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ABSTRACT 


A  categorisation  of  file  handling  facilities  in  terms  of  levels 
is  proposed  and  illustrated.  This  categorisation  covers  the  span 
from  the  naive  user  to  the  actual  physical  storage.  The  performances 
of  a  variety  of  keyed-access  mechanisms  are  derived  for  both  a  homogeneous 
and  a  two-level  storage  model.  Some  environmental  requirements  on  file 
handling  characteristics  are  considered.  A  number  of  technical  factors 
of  particular  importance  in  filing  systems  are  discussed.  A  data  manage¬ 
ment  system  is  described  and  analysed  in  some  detail.  Project  organis¬ 
ation  is  discussed  with  particular  emphasis  on  the  project  initiation 
stage. 


PREFACE 


These  notes  were  written  by  the  author  in  preparation  for  a  new 
graduate  course  in  File  Organisation  and  Structure  in  the  Department 
of  Computer  Science  at  the  University  of  Toronto.  Little  exists  in 
the  way  of  established  conceptual  models  for  complete  filing  systems 
covering  the  span  from  the  naive  user  to  the  actual  physical  storage. 

The  Codasyl  Systems  Committee  [8],  have  identified  a  number  of  feature 
areas  in  the  context  of  generalized  data  base  management  systems  whereas 
Madnick  and  Alsop  [32],  have  proposed  a  model  for  the  machine-dependent 
end  of  the  file  handling  spectrum.  An  attempt  is  made  in  these  notes 
to  lay  the  ground  work  for  such  a  model  based  on  existing  file  handling 
facilities.  Hopefully  this  will  provide  a  framework  for  discussion 
and  lead  to  a  better  understanding  of  storage  and  access  structures  in 
general . 

An  attempt  has  been  made  here  to  use  simple  mathematics  to  illus¬ 
trate  the  effect  of  storage  characteristics  on  file  handling  performance. 
Little  theory  exists  in  this  area  and  clearly  there  is  much  scope  for 
research.  However,  if  a  file  model  is  to  be  at  all  realistic,  it  must 
be  based  on  a  realistic  storage  model.  This  point  of  view  is  central 
to  these  notes  and  provides  a  perspective  quite  different  to  that  seen 
from  a  homogeneous  storage  aspect. 

In  conclusion,  the  contents  of  this  report  should  be  considered 
experimental.  The  author  would  greatly  appreciate  any  comments  on  the 
material  covered. 
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CHAPTER  1 


Files:  The  Levels  of  File  Handling  Facilities 


1 . 1  Introduction 


When  different  people  use  the  term  "file",  what  they  mean  depends 
on  their  background.  A  computer  hardware  engineer  often  uses  the  term 
when  referring  to  physical  devices,  such  as  drums.  Low-level  software 
programmers  refer  to  an  entity  which  corresponds  closely  to  the  physical 
device  but  which  is,  in  fact,  formed  by  the  insertion  of  a  simple  organ¬ 
ising  mechanism  between  the  raw  device  and  the  programmer.  By  the  time 
we  get  to  the  "non-programming  user",  the  position  has  completely 
altered.  To  such  a  person  a  file  is  commonly  an  object  which  contains 
data  entities,  related  in  some  way,  where  the  entities  can  be  retrieved 
or  modified  through  relatively  simple  requests  to  the  file  facility. 

This  chapter  attempts  to  categorise  all  file  facilities  in  terms 
of  level  of  facility,  i.e.,  higher  level  facilities  are  built  up  from 
lower  level  facilities.  All  the  levels  introduced  are  not  necessarily 
included  explicitly  in  all  file  facilities,  even  those  extending  as  far 
as  the  "non-programming  user".  However,  all  file  facilities  are,  in 
some  sense,  describable  in  terms  of  the  levels. 

The  level  approach  has  proved  invaluable  in  many  areas,  such  as 
programming  languages  and  operating  systems.  In  the  author's  opinion 
it  is  likely  to  prove  just  as  valuable  a  concept  in  file  handling. 

The  levels  which  we  will  present  are  termed  physical  files,  pseudo¬ 
physical  files,  pseudo-logical  files  and  user-logical  files.  They  will 
be  considered  in  the  following  sections.  First,  however,  it  is  necessary 
to  consider  files  in  general.  We  introduce  the  following  definition  of 
a  file: 

definition:  A  file  is  a  storage  device  together  with  a  set 
of  algorithms  for  altering  the  data  content 
state  of  the  device  and  for  providing  data  (from 
the  device)  to  the  user  for  his  own  purposes. 

"Storage  Device"  is  used  here  in  a  general  sense. 

"User"  is  also  general  and  may  refer  to  an  actual  person  (for  example, 
someone  at  a  terminal)  or  to  a  program.  An  illustration  of  a  file  as 
defined  is  given  in  Fig.  1-1. 
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Fig.  1.1  General  Description  of  a  File 


We  now  introduce  certain  properties  of  a  file. 

A  file  may  utilise  permanent  storage,  semi -permanent  storage  or 
temporary  storage,  where  these  terms  are  considered  relative  to  some 
unit,  i.e.,  if  a  file  gives  up  some  or  all  of  its  storage  at  the  end  of 
every  program  execution,  then  that  storage  is  considered  to  be  temporary. 
Storage  which  is  retained  and  hence  can  be  assigned  to  hold  the  same  data 
over  a  number  of  program  executions,  is  considered  to  be  semi -permanent . 
Storage  which  is  retained  for  an  indefinite  period  is  considered  to  be 
permanent. 

The  above  concept  leads  to  the  observation  that  file  data  may  have 
a  permanent,  semi -permanent  or  temporary  lifetime  (in  the  same  relative 
sense) .  Some  people  would  use  file  instead  of  file  data  in  this  last 
statement  but  since  we  have  included  data  manipulation  algorithms  within 
the  definition  of  a  file,  this  is  not  consistent  in  our  case. 

We  will  now  give  a  few  examples  of  different  files  which  occur  in 
practice,  and  illustrate  how  they  fit  the  above  scheme. 

Example  1 

A  simple  magnetic  tape  file  conventionally  consists  of  a  permanent 
(or  semi-permanent)  magnetic  tape  "storage  device"  plus  algorithms  to 

(a)  read  next  record 

(b)  write  next  record 

(c)  backspace  a  record 

(d)  rewind  to  beginning  of  tape 

(e)  put  an  end  of  file  data  mark  on  the  tape. 
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The  working  storage  is  core  store  (temporary)  and  typically  contains 
the  current  record  read  from  or  to  be  written  to  the  tape.  The  user  is 
a  program. 

Example  2 

A  common  technique  among  programmers  is  to  utilise  part  of  their 
variable  space  as  file  device  storage  during  program  execution.  That 
is,  they  store  data  in  the  area  in  bulk  and  select  portions  of  it  for 
processing  in  certain  working  areas  of  their  variable  space.  All  the 
file  data  thus  has  a  temporary  lifetime  unless  the  programmer  takes 
special  action  to  store  it  in  permanent  storage. 

Sometimes,  due  to  space  shortage,  temporary  file  data  is  stored  in 
permanent  storage  and  deleted  either  explicitly  or  automatically  (scratch 
space) . 

Example  3 

In  virtual  memory  all  data  is  stored  in  the  program  variable  space. 
However,  the  mechanism  holds  all  the  data  in  permanent  storage  and  retrieves 
some  of  it  for  temporary  core  storage  under  control  of  an  allocation 
algorithm.  Note  that,  in  this  setup,  the  concept  of  working  area 
in  variable  space  may  still  be  utilised.  In  this  case  the  program  thus 
appears  to  operate  as  in  example  2,  except  for  the  removal  of  the  amount 
of  data  restriction  and  the  permanency  of  data.  Efficiency  of  operation 
is  a  problem  here  which  we  will  leave  for  now  but  consider  later  on. 

From  this  last  example  we  can  see  that  it  is  possible  for  two  levels 
of  file  facility  to  exist  together;  i.e.,  the  logical  division  of  a  pro¬ 
gram  variable  space  into  working  areas  and  storage  areas  and  the  physical 
division  of  variable  space  by  a  virtual  memory  mechanism  into  temporary 
working  storage  (core  store)  and  permanent  storage. 

Example  4  * 

To  a  user,  at  a  terminal,  of  a  simple  information  retrieval  system, 
the  file  storage  device  is  simply  a  content  addressed  device  with  the 
addressing  corresponding  to  a  certain  data  structure.  The  file  algorithms 
provide  data  entities  from  the  device  corresponding  to  user  requests 
specifying  a  number  of  keys .  The  entities  are  provided  to  the  user  in 
a  display  fashion  (i.e.,  on  teletype  or  picture  screen). 

In  a  sense  our  definition  of  a  file  looks  remarkably  like  the 
popular  block  diagram  of  a  computer  used  in  introductory  courses.  The 
difference  is  in  orientation,  in  that  the  emphasis  is  on  storage  and 
retrieval  of  data  rather  than  processing.  However,  these  operations  are 
themselves  processes  and  may  be  extremely  complex  if  the  file  storage  is 
highly  structured. 

The  basic  notion  in  file  data  usage  is  that  of  operating  in  a  general¬ 
ised  way  on  subsets  of  data  obtained  from  the  file;  i.e.,  the  operations 
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are  performable  on  any  data  subset  of  a  given  type.  The  file  algorithms 
reference  and  maintain  these  data  subsets.  The  file  working  storage  con¬ 
sists  of  storage  for  those  algorithms  and  storage  to  contain  the  selected 
subsets . 

A  point  which  we  have  not  mentioned  is  that  the  term  file  is  generally 
reserved  for  dealing  with  reasonably  large  quantities  of  data.  It,  there¬ 
fore,  conventionally  excludes  such  things  as  the  use,  by  computer  control, 
of  core  store  as  file  device  storage.  We  also,  in  these  notes,  will  apply 
the  same  restriction  although  noting  that  it  is  a  restriction.  People 
generally  differ  regarding  what  constitutes  large  amounts  of  data.  We 
take  the  following  quantisation  from  Naur’s  Datalogy  paper  [  51]. 

(a)  Intermediate  quantities  of  data  are  such  as  can  be  held  in  (and 
are  of  the  order  of  size  of)  the  homogenous  working  store  of  the  computer; 

(b)  A  data  quantity  is  called  large  if  it  requires  auxilary  storage  in 
the  computer.  Note  that  it  is  possible  to  have  intermediate  sized  data 
files  which  must  be  treated  as  large  simply  because  they  are  constrained 
to  use  a  small  amount  of  homogenous  storage.  Generally,  under  these 
conditions  the  "homogenous  working  storage  image"  of  the  file  consists  of 
one  or  more  well  defined  (in  a  storage  and  access  sense)  data  subsets  of  the 
file  data,  which  are  intended  to  be  processed  as  a  unit  in  homogenous 
working  storage.  We  will  see  this  feature  illustrated  when  we  look  at 
files  in  detail. 

1 . 2  Physical  Files 

In  the  context  of  this  scheme,  a  physical  file  consists  of  a  physical 
device  together  with  a  set  of  machine  instructions  which  permit  I/O  oper¬ 
ations  to  be  carried  out  between  the  device  and  the  computer  main  memory. 

These  instructions,  or  their  assembly  language  equivalences,  are  rarely 
used  directly  by  programmers  although  some  systems  programmers  have  need 
of  them. 

For  example,  in  OS/360  they  are  generally  used  only  by  the  input/output 
supervisor  which  is  a  piece  of  software  providing  a  general  interface  between 
other  software  and  the  devices.  It  is  connected  to  the  devices  via  channels, 
each  channel  usually  servicing  a  number  of  devices. 

One  of  the  tasks  of  the  I/O  supervisor  which  is  worth  mentioning  is 
buffering.  An  area  of  core  store  is  set  aside  for  a  buffer.  Data  is 
passed  from  a  device  to  a  buffer  and  then,  at  a  suitable  moment,  the  data 
is  passed  from  the  buffer  to  the  program  requiring  it.  Thus  the  transfer 
of  data  from  a  device  does  not  have  to  occur  simultaneously  with  the 
request  for  the  data  by  the  program.  This  feature  allows  I/O  -  compute 
overlap  for  a  particular  program  and  also  enables  devices  to  be  more  easily 
shared  by  different  programs. 

Note  that  the  form  of  requests  for  device  handling  to  the  I/O  super¬ 
visor  is  different  from  the  basic  device  machine  instructions.  We  will 
not  consider  this  subject  in  any  detail  although  some  coverage  is  given 
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in  1.3.  Relevant  information  regarding  OS/360  can  be  found  in  [1,  Chap. 

7]  and  [15] . 

In  the  rest  of  this  section  we  will  look  at  some  physical  devices 
and  their  storage  media  with  the  intent  of  abstracting  out  some  import¬ 
ant  operational  characteristics. 

We  will  concern  ourselves  with  re-usable  storage  devices  only,  namely, 
drums,  discs  and  magnetic  tapes.  Detailed  information  on  these  devices 
is  best  obtained  from  actual  device  specification  manuals.  Some  inform¬ 
ation  is  given  in  [1,  Chap.  8]. 

(a)  Drums  and  Head  Per  Track  Discs 


With  these  devices  total  access  time  is  approximately  rotational 
delay  time  plus  transfer  time. 

Rotational  delay  time  is  that  time  required  to  position  the  read/ 
write  head  to  the  position  on  the  track  where  the  block  of  data  to  be 
accessed  resides.  Note  that  there  is  one  head  for  each  track  of  the 
device  storage  media. 

Transfer  time  is  the  time  to  scan  the  data  block  and  pass  it  to  a 
buffer. 

Average  rotational  delay  time  is  one-half  a  revolution  of  the 
(circular)  track  and  is  of  the  order  of  10  msecs.  In  some  cases  it  is 

necessary  to  locate  the  homing  or  start  position  of  a  track  before 

locating  the  actual  data  block.  In  this  case  the  extra  rotational 
delay  ranges  from  zero  for  a  full  track  block  to  one-half  a  revolution 
for  many  blocks  per  track. 

(b)  Head  Per  Surface  Discs 

i 

In  this  case  one  head  is  shared  by  a  number  of  tracks,  the  head 

being  placed  on  a  movable  arm.  The  arm  movement  time  must  thus  be 

added  to  the  total  access  time  for  case  (a)  to  give  the  total  access 
time  in  this  case.  In  a  typical  case  (IBM  2314  disc  device)  there  is 
one  head  per  200  tracks  with  an  average  arm  movement  time  of  the  order 
of  60  msecs.  Note  that  the  arm  time  is  not  linear  with  the  number  of 
tracks  moved.  There  is  an  inertia  effect  resulting  in  a  higher  aver¬ 
age  time  per  track  moved  for  small  numbers  of  tracks  moved  as  compared 
with  large  numbers  of  tracks  moved. 

(c)  Magnetic  Tapes 

A  magnetic  tape  has  linear  as  compared  to  circular  tracks.  The 
inertial  time  to  obtain  full  tape  speed  from  stopped  is  of  the  order 
of  a  few  msecs.  Transfer  time  is  then  typically  of  the  order  of  50- 
100  msec  per  6K  bytes. 


(d)  Direct  Access  Media  Track  Format 


Since  it  has  some  bearing  on  later  work  in  these  notes,  we  will 
look  at  a  typical  track  format  for  discs,  drums,  etc.  Our  example 
is  IBM  360  track  format  [1,  Chap.  9],  [14,  Section  II]  for  IBM  2311 
and  2314  disc  drives.  Each  track  is  divided  into  a  number  (depending 
on  size)  of  data  records.  Each  data  record  consists  of  either  two  or 
three  blocks:  Count-Data  blocks  or  Count-Key-Data  blocks.  The  first 
data  record  on  the  track  is  called  the  Track  Descriptor  record  and 
always  consists  of  Count-Data  blocks  only.  It  is  termed  the  zero 
record  (R0)  of  the  track . 

The  Count  block  contains  among  other  things,  an  eight  byte  data 
record  address  (cylinder,  head,  record  number),  the  key  length  (zero  if 
no  key)  and  data  length. 

The  key  block  contains  a  key  which  may  be  used  for  keyed  access. 

The  data  block  contains  data.  There  are  a  number  of  data  block  formats 
which  we  will  deal  with  in  1.3. 

Exact  details  of  device  record  addresses  can  be  found  in  [14,  p.84]. 

We  will  complete  this  section  with  a  number  of  points  related  to 
the  operational  characteristics  of  the  physical  devices  we  have  con¬ 
sidered. 

(1)  Access  times  are  typically  of  the  order  of  10  to  100  msecs  for 
accessing  a  block  of  data  from  auxiliary  storage  devices.  This 
should  be  compared  to  core  store  where  access  times  are  of  the 
order  of  1  ysec  for  accessing  a  storage  word. 

(2)  The  following  practice  can  yield  substantial  gains  in  efficiency 
when  working  with  auxiliary  storage.  Arrange  the  contents  of 
blocks  so  that,  when  a  block  is  transferred  to  core  storage,  many 
accesses  are  made  to  the  contents  before  the  core  area  must  be 
given  up  for  other  data.  If  the  contents  correspond  to  execu¬ 
table  instructions,  then  this  is  already  the  case  to  some  degree. 
This  is  why  virtual  memory  obtains  some  degreee  of  success  [52] . 

If  the  contents  correspond  to  data,  this  may  or  may  not  be  true 
and  it  may  be  difficult,  if  not  impossible,  to  utilise  the 
suggested  practice. 

(3)  Magnetic  tapes  are  not  suitable  for  random  access  due  to  (i)  length 
(ii)  time  penalty  for  change  of  direction.  They  are  suitable 
for  sequential  access  especially  where  access  requests  are  suffic¬ 
iently  dense  in  time  to  avoid  the  stopped  tape  state.  Note  that 
random  and  sequential  refer  to  the  physical  ordering  of  data  on 
the  device. 
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(4)  Since  the  direct  access  storage  is  constantly  rotating,  the  danger 
exists  that  the  head  is  just  past  the  required  block  when  the  access 
is  requested,  leading  to  a  whole  rotation  delay.  If  this  happens 

to  occur  for  every  access  of  a  series,  a  serious  degradation  results. 
[1,  Appendix  B]  contains  a  good  article  (reprinted  from  Datamation) 
on  this  point. 

(5)  On  physical  sequential  access,  a  head  per  surface  disc  is  approxi¬ 
mately  as  fast  as  drums  and  head  per  track  discs,  arm  movement  for 
one  track  displacement  being  a  few  msecs  only.  On  random  physical 
access,  head  per  surface  discs  are  much  slower  than  drums  and  head 
per  track  discs. 

(6)  Direct  access  storage  is  frequently  left  on-line  in  an  installation 
even  where  removable  storage  is  involved.  This  means  that  the 
storage  is  in  potential  overwrite  danger  much  greater  than  when  it 
is  set  up  only  for  small  usage  periods.  This  is  an  important 
factor  especially  in  these  days  of  third  generation  machine  operating 
systems  which  frequently  have  few  protection  facilities  and  are  also 
complex  for  the  human  operator  to  handle.  For  example  it  is  not 
unknown  for  an  operator  to  take  over  a  private  storage  media  unit 
for  scratch  space  and  destroy  valuable  data'.  It  is  typically  the 
case,  therefore,  that  on-line  data  exists  in  an  off-line  backup 
state  on  such  devices  as  magnetic  tapes  or  data  cells. 

1 . 3  Pseudo-Physical  Files 

These  files  consist  of  physical  files  plus  some  software  and  have 
logical  structures  which  largely  parallel  physical  device  structures.  We 
will  divide  this  type  of  file  into  two  classes,  which  we  shall  call  low 
and  high  level  pseudo-physical  files,  and  which  correspond  to  the  level 
of  the  programming  language  in  which  they  are  usually  utilised.  In  some 
cases  a  particular  type  of  file  structure  appears  on  both  levels.  An 
example  of  this  is  indexed  sequential  which  exists  in  the  assembly  and 
COBOL  languages  for  the  IBM  360.  Where  this  effect  occurs,  some  of  the 
specification  details  are  automated  in  the  high  level  case,  resulting  in 
an  apparently  simpler  and  easier  to  use  facility. 

Another  feature  which  occurs  in  pseudo -physical  files  is  that  high- 
level  structures  may  be  organised  hybrids  of  several  low-level  structures. 
We  will  see  an  example  of  this  in  1.3.2. 

1.3.1  Low-level  Pseudo-Physical  Files 

These  files  are  generally  provided  by  what  are  called  the  access 
methods  of  an  operating  system.  The  facility  is  usually  in  the  form 
of  macro  instructions  (but  sometimes  subroutine  packages)  in  the 
assembly  language  of  the  machine.  See  [14,  Section  II]  and  [1,  Chap. 
7  §10]  for  information  on  OS/360.  We  will  use  the  OS/360  case  as  an 
illustration  of  this  level  of  file. 
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For  OS/360  the  access  methods  communicate  with  the  I/O  super¬ 
visor  via  a  standard  interface.  This  is  illustrated  in  Fig.  1.2. 


access  macro  instructions 


Fig.  1.2  OS/360  Access  Method  Interfaces 


We  will  now  consider  some  specific  access  methods. 

Example  1  -  Sequential  Access  Method  (SAM) 

This  method  has  the  logical  storage  structure  of  a  magnetic 
tape.  That  is,  it  is  a  linear  list  of  records  where  a  record  can 
only  be  accessed  from  its  immediate  neighbours  on  the  list  (except 
for  the  first  record) .  Accessing  a  record,  therefore,  involves 
starting  at  the  beginning  of  the  list  (or  current  position)  and 
accessing  the  records  "sequentially"  until  the  required  record 
has  been  accessed. 

It  is  important  to  realize  that  sequential  does  not  imply  any 
ordering.  The  user  of  a  sequential  file  may  or  may  not  wish  to 
utilize  a  content  ordering  rule,  and,  if  required,  it  is  set  up 
externally  to  the  sequential  access  method. 

The  actual  pseudo-physical  ordering  of  the  records  is  related 
to  but  not  necessarily  identical  to  the  physical  storage  order. 

To  illustrate  the  relationship  we  must  see  how  the  files  are  mapped 
onto  physical  devices.  SAM  data  sets  (files)  may  exist  on  magnetic 
tapes  or  direct  access  storage  media.  In  the  case  of  magnetic  tapes 
the  pseudo-physical  and  physical  orders  are  the  same. 
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For  direct  access  storage  media  the  orders  may  differ.  Each 
direct  access  storage  volume  (i.e.,  disc  pack)  contains  a  "volume 
table  of  contents"  (VTOC)  which  contains  one  or  more  "data  set 
control  blocks"  (DSCB)  for  each  data  set  on  the  volume.  An  actual 
data  set  is  divided  up  into  areas  of  physically  contiguous  storage 
called  extents,  with  the  extents  arrayed  in  some  order,  not  necess¬ 
arily  corresponding  to  physical  ordering.  The  information  on  what 
extents  a  data  set  has  and  where  they  exist  is  given  in  the  DSCB’s 
for  the  data  set.  The  SAM  access  method  thus  maps  a  single  pseudo¬ 
physical  list  of  records  into  one  or  more  physical  lists  or  extents 
[14,  Section  11],  [1,  Chapter  6]. 

Example  2  -  Basic  Direct  Access  Method  (BDAM) 

In  this  method  the  system  accesses  physical  blocks  by  address. 

The  address  may  be  an  actual  physical  address  or  an  address  relative 
to  the  beginning  of  the  data  set  (in  terms  of  tracks  or  blocks) .  In 
the  second  case  the  VTOC  mapping  is  utilised.  A  further  option 
exists  whereby,  if  the  keyed  track  format  is  utilised,  keyed  access 
is  available.  Here  the  search  commences  at  the  provided  address 
and  continues  until  the  key  is  found  or  determined  absent.  Con¬ 
ditions  can  be  placed  on  the  search  (see  the  Macro  Instruction 
companion  to  [14]). 

Example  3  -  The  Queued  Access  Methods 

With  this  type  of  access  the  macro  instructions  deal  with  records 
rather  than  blocks  and  these  records  are  automatically  amalgamated 
into  blocks  for  I/O  transfer. 

The  method  is  used  only  with  data  sets  that  have  a  sequential 
nature  and  anticipatory  buffering  can  thus  be  automatically  provided. 
In  Example  1,  we  dealt  with  the  SAM  access  method.  In  fact  there 
are  two  versions  of  this  method,  a  basic  version,  BSAM,  and  a  queued 
version,  QSAM. 

1.3.2  High  Level  Pseudo-Physical  Files 

These  are  the  facilities  provided  in  higher  level  languages  of 
a  pseudo-physical  nature.  The  access  methods  of  the  operating 
system  are  usually  utilised  in  providing  these  files.  To  illus¬ 
trate  the  characteristics  of  this  level  of  structure  we  take  a  look 
at  the  file  facilities  provided  in  FORTRAN  IV  (G  and  H)  on  OS/360. 

Note  that  the  facilities  described  correspond  to  Release  17 
(approximately)  and  do  not  include  the  "spanned,  unspanned"  option. 
Introductory  material  on  OS/ 360  can  be  found  in  [17] .  The  unit  of 
data  handled  in  an  organised  sense  in  OS/360  is  termed  a  data  set. 

A  data  set  is  a  logically  related  collection  of  data  which  has  been 
named  and  described  to  the  system.  The  data  in  a  data  set  is 
divided  into  logical  units  of  data  called  records  and  into  physical 
units  of  data  called  blocks.  Note  that  a  block  refers  to  a 
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physical  area  separated  from  other  physical  areas  by  physical 
spaces . 

If  a  user  wishes  to  run  a  program  under  OS/ 360,  he  must 
present  his  program  plus  some  job  control  cards.  All  we  need 
to  know  here  is  that  one  data  definition  (DD)  statement  is  re¬ 
quired  for  every  data  set  utilised  by  the  program.  In  general 
the  DD  statement  specifies  the  name  of  the  data  set,  its  position 
in  the  storage  system,  its  size,  certain  storage-related  properties 
and  the  status  (new,  old,  etc.)  of  the  data  set. 

Example  A  -  Data  Sets  in  QS/360  FORTRAN  IV  (G  and  H) 

There  are  two  types  of  data  set  in  OS/360  FORTRAN  IV  (G  and  H) ; 
sequential  and  direct  access.  A  sequential  data  set  may  exist 
either  on  a  sequential  or  a  direct  access  device.  A  direct  access 
data  set  must  exist  on  a  direct  access  device  (although  the  data  in 
it  may  be  stored  on  a  sequential  device  while  not  in  use) . 

The  following  three  definitions  are  required  for  this  example: 

logical  record:  The  amount  of  data  associated  with  the  input/ 

output  list  of  a  language  read  or  write  statement. 

FORTRAN  record:  The  amount  of  data  corresponding  to  one  scan  of 

a  FORMAT  statement.  Exceptions  are: 

(a)  a  /  terminates  one  FORTRAN  record  and 
starts  the  next. 

(b)  the  end  of  the  I/O  list  terminates  a 
FORTRAN  record. 


record  segment:  The  unit  of  data  into  which  logical  records  are 

split.  It  is  a  physical  quantity  in  that  one 
or  more  record  segments  are  contained  within  a 
physical  block.  It  is  a  logical  quantity  in  that 
in  some  cases  it  is  associated  with  a  user  quantity 
(i.e.,  FORTRAN  record). 

Example  A. 1  Direct  Access  Data  Sets  -  Language  Facilities 

Before  a  user  can  access  a  direct  access  data  set,  he  must 
define  it  in  his  program.  This  is  true  regardless  of  whether  the 
data  set  is  already  in  existence  on  a  hardware  device  or  not.  A 
direct  access  data  set  is  defined  by  the  statement 


DEFINE  FILE  a(m,r,f,v) 
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where  a_  is  an  integer  constant  specifying  the  number  by  which  the 
user  refers  to  the  data  set  (data  set  reference  number) 

m  is  an  integer  constant  specifying  the  number  of  record 
segments  in  the  data  set 

_r  is  an  integer  constant  specifying  the  length  of  a  record 

segment  (a  physical  record  may  contain  some  meaningless 
filler  information) .  The  length  is  given  in  units  of 
one  or  four  bytes  according  to  the  f  specification. 

f_  is  one  of  three  options 

(i)  L  data  set  read  or  written  either  with  or  without 
format  control,  r_  in  one  byte  units 

(ii)  E  data  set  read  or  written  under  format,  r  in 
four  byte  units 

(iii)  U  data  set  read  or  written  without  format,  £ 
in  four  byte  units 

v  is  a  four  byte  non-subscripted  integer  variable  called  an 
associated  variable.  This  is  set  by  the  Fortran  soft¬ 
ware  to  a  value  referencing  the  record  segment  immedi¬ 
ately  following  the  last  transmitted. 

The  data  set  must  also  be  specified  in  a  Job  Control  Language 
DD  statement  with  a  dd  name  of  the  form  FTxxFOOl  where  xx  is  the 
data  set  reference  number  (a)  used  by  the  programmer.  The  usual 
information  concerning  where  the  data  set  exists  on  backing  storage 
is  required  in  the  DD  statement,  but  information  on  data  set  layout 
(DCB  parameter)  should  not  be  specified. 

i 

In  summary:  JCL  specifies  data  set  position  and  size 

User  program  specifies  data  set  layout  and  size 

From  the  user  aspect  a  direct  access  data  set  consists  of  fixed 
length  record  segments  numbered  from  1  upwards  to  the  total  number 
of  records.  An  input/output  statement  will  always  transfer  one  or 
more  complete  record  segments,  according  to  the  input/output  list, 
setting  the  associated  variable  pointer  to  the  next  record  segment 
after  the  last  one  transferred. 

When  operating  under  FORMAT,  one  Fortran  record  is  placed  in 
each  record  segment.  When  operating  UNFORMATTED,  record  segments 
are  filled  and  transferred  as  required. 

In  Fortran  IV  G,  a  direct  access  data  set  is  mapped  onto  backing 
storage  in  layout  F  (see  A. 3). 
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Information  is  input  to  and  output  from  a  data  set  using  the 
statements : 

READ  (a'r,b,ERR=d) list 
WRITE  (a'r,b) list 

where  a_  is  an  unsigned  integer  constant  or  integer  variable  (length 
4  bytes)  specifying  the  data  set  reference  number 

t_  is  an  integer  expression  specifying  the  relative  record 
segment  number  at  which  transfer  is  to  start 

b  is  an  optional  parameter  and  is  either  a  FORMAT  statement 
number  or  the  name  of  an  array  containing  a  format 
specification 

ERR=d_  is  an  optional  parameter  and  d  is  the  statement  number  to 

which  control  is  passed  in  the  event  of  an  error  condition 
during  transfer 

list  is  optional  and  is  an  input/output  list 

Note  that  the  number  of  record  segments  transferred  always 
corresponds  to  the  amount  of  information  specified  by  the  input/output 
list.  The  FIND  (a_’r)  statement  used  before  a  READ  statement  refer¬ 
encing  the  same  record  segment  enables  the  information  transfer  time 
(not  the  "seek"  time  to  move  the  hardware  device  reading  heads)  to  be 
overlapped  with  user  program  execution. 

Example  A. 2  Sequential  Data  Sets  -  Language  Facilities 

A  sequential  data  set  is  completely  defined  in  Job  Control  Lang¬ 
uage.  A  DD  statement  is  required  with  a  dd  name  of  the  form  FTxxFOOn 
where  xx  is  the  data  set  reference  number  and  n  is  the  sequence  number 
for  the  data  set.  A  number  of  distinct  data  sets  may  exist  together 
as  a  group  with  the  same  data  set  reference  number  but  distinct 
sequence  number.  The  method  of  dealing  with  sequenced  data  sets  is 
somewhat  involved  and  will  not  be  discussed  here.  Hence  we  will  con¬ 
sider  only  the  case  where  n=l  (see  Fortran  IV  (G)  Programmers  Guide) . 

The  DD  statement  for  a  sequential  data  set  thus  specifies  both 
the  data  set  position,  in  the  usual  way,  and  the  data  set  layout  and 
size  (in  the  DCB  and  SPACE  parameters) .  In  Fortran  IV  G,  the  layout 
options  (see  A. 3)  for  sequential  data  sets  are 

for  formatted  information  F,FB,VB,V,U 
for  unformatted  information  V,VB 
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In  the  formatted  case,  one  Fortran  record  is  equivalent  to  a 
record  segment  (apart  from  the  appropriate  segment  control  words 
and  filler  information)  as  described  in  A. 3.  A  logical  record, 
which  is  the  amount  of  information  corresponding  to  an  input/output 
statement,  is  simply  a  number  of  Fortran  records  and  no  information 
is  held  in  the  data  set  to  specify  the  logical  record  lengths  or 
boundaries.  Thus,  when  inputting  logical  records,  they  do  not 
have  to  correspond  to  the  logical  records  output  previously  to  the 
data  set. 

In  the  unformatted  case,  information  is  held  in  the  data  set  to 
specify  the  logical  record  lengths  and  boundaries  (see  A. 4).  A 
logical  record  is  divided  into  record  segments  with  a  maximum  size. 

That  is,  if  a  logical  record  is  shorter  than  the  maximum  record 
segment  length,  then  a  shorter  record  segment  is  formed,  whereas, 
if  a  logical  record  is  greater  than  the  maximum  record  segment,  then 
one  or  more  full  size  record  segments,  with  possibly  a  shorter  last 
one,  are  formed.  No  filler  information  is  present  in  any  unformatted 
record  segment. 

Information  is  input  to  and  output  from  a  data  set  using  the 
statements : 

READ  (a,b ,END=c,ERR=d) list 
WRITE  (a,b) list 

where  a  is  an  unsigned  integer  constant  or  integer  variable  (length 
4  bytes)  specifying  the  data  set  reference  number 

b_  is  an  optional  parameter  and  is  either  a  FORMAT  statement 

number  or  the  name  of  an  array  containing  a  format  speci¬ 
fication  (it  may  also  be  a  NAMELIST  name  but  we  will  not 
consider  this  option  any  further),  (see  Fortran  IV  Lang¬ 
uage  Manual) . 

END=c  is  an  optional  parameter  and  c_  is  the  statement  number  to 
which  control  is  passed  on  encountering  the  end  of  the 
data  set 

ERR=d  is  an  optional  parameter  and  d_  is  the  statement  number  to 
which  control  is  passed  in  the  event  of  an  error  con¬ 
dition  during  transfer 

list  is  optional  and  is  an  input/output  list 

Note  that  the  number  of  record  segments  transferred  always  corres¬ 
ponds  to  the  amount  of  information  specified  by  the  input/output  list. 

There  are  three  more  statements  associated  with  sequential  data 


sets . 
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ENDFILE  a 
REWIND  a 
BACKSPACE  a 

where  £  is  an  unsigned  integer  constant  or  integer  variable  (length 
4  bytes)  specifying  the  data  set  reference  number 


ENDFILE  defines  the  end  of  the  data  set 

REWIND  causes  the  next  input/output 

statement  to  commence  operations 
at  the  first  record  of  the  data  set 

BACKSPACE  causes  the  next  input/output  state¬ 
ment  to  commence  operations  as  des¬ 
cribed  below 

Formatted:  F,V,U  layouts  -  positioned  at  last  Fortran  record 

transferred 


Unformatted: 


F,V,U  layouts  -  positioned  at  last  logical  record 

trans f erred 


Formatted  and  Unformatted:  FB,VB  layouts  -  unpredictable  results 

As  can  be  seen,  no  information  exists  in  the  input/output  state¬ 
ments  concerning  position  in  the  data  set.  This  brings  us  to  the 
fundamental  concept  of  sequentialism. 

A  sequential  data  set  is  simply  a  series  of  records  in  sequence 
with  the  following  properties.  Input/output  is  always  commenced  at 
the  next  record  in  sequence  (Fortran  if  formatted  or  logical  if 
unformatted)  unless  REWIND  or  BACKSPACE  are  used.  If  a  record  is 
written  to  the  middle  of  a  data  set,  then  all  following  information 
becomes  unattainable. 


Example  A. 3  System/360  Operation  System  Facilities 

In  this  section  the  term  block  is  used  for  physical  block. 

There  are  five  block  layouts.  These  are  set  up  for  the  user  on 
the  appropriate  medium  by  the  operating  system,  initiated  either  by 
Job  Control  Language  statements  put  in  by  the  user  or,  in  some  cases, 
by  automatic  default  functions  for  the  particular  compiler.  Thus, 
for  example,  the  "buffers"  mentioned  in  A. 4  are  not  the  concern  of 
the  user  but  are  arranged  for  him. 

A  layout  is  specified  by  the  RECFM  subparameter  of  the  DCB 
parameter  of  the  DD  statement  for  the  data  set. 
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The  five  layouts  can  be  described  as  follows: 

A. 3.1  RECFM=F 

Fixed  length  blocks,  1  block  =  1  record  segment, 

record  segment  =  user  data  only.  Block  size  specified  in 

JCL  by  BLKSIZE  subparameter  of  DD  statement  for  data  set. 

BLKSIZE  -*  BLKSIZE  ->  BLKSIZE  -> 


A. 3. 2  RECFM=FB 


Fixed  length  blocks  and  record  segments,  1  block  ~  n  record 
segments,  record  segment  =  user  data  only. 

Block  and  record  segment  sizes  specified  in  JCL  by  BLKSIZE 
and  LRECL  subparameters  respectively  (BLKSIZE  =  n  *  LRECL) 
of  DD  statement  for  data  set. 

BLKSIZE  ->  BLKSIZE  + 


^LRECL+  ^LRECL+  ^-LRECL  +  ^LRECL^ 

Last  block  of  data  set  may  be  shorter  (still  integral  no.  of  LRECL' s) 
A. 3. 3  RECFM=V 

t \hriable  length  blocks  and  record  segments,  1  block  =  block 
control  word*  +1  record  segment,  record  segment  =  segment 
control  word*  +  user  data. 

Maximum  block  and  record  segment  sizes  specified  in  JCL  by 
BLKSIZE  and  LRECL  subparameters  respectively  (BLKSIZE=4+LRECL) 
of  DD  statement  for  data  set. 


block  size  <  BLKSIZE 


< -  record  - record  segment  size  <  LRECL 

segment 

< -  block  - > 


*  See  note  at  end  of  A. 3. 5 
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A. 3.4  RECFM=VB 

Variable  length  blocks  and  record  segments,  1  block  =  block 
control  word*  +n  record  segments,  record  segment  =  segment 
control  word*  +  user  data. 

Maximum  block  and  record  segment  sizes  specified  in  JCL  by 
BLKSIZE  and  LRECL  subparameters  respectively  (BLKSIZE  = 

4  +nx  LRECL)  of  DD  statement  for  data  set. 


SCW 


BCW 


SCW 


record  record 


segment  segment 

4 -  block  - 7- 

BLKSIZE-LRECL  <  block  size  <  BLKSIZE 
record  segment  size  <  LRECL 

N.B.  Record  segments  are  packed  into  a  block  until  such  time  as  the 
remaining  space  is  less  than  the  maximum  record  segment  size 
allowed.  The  one  exception  to  the  minimum  block  size  is  the 
last  block  written  in  a  job  step. 

A. 3. 5  RECFM=U 


Undefined  length  blocks,  1  block  e  1  record  segment,  record 
segment  =  user  data  only. 

Maximum  block  size  specified  in  JCL  by  BLKSIZE  subparameter 
of  DD  statement  for  data  set. 


block  size  <  BLKSIZE 


< - block  - > 

NOTE:  Block  control  consists  of  4  bytes  of  which  the  first  two 

contain  the  block  length  and  the  last  two  are  set  to  zero 
(reserved  for  system  use) ,  Segment  control  word  consists 
of  4  bytes,  the  first  two  contain  record  segment  length, 
last  two  set  zero. 


*  See  note  at  end  of  A. 3.5 
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Example  A. 4  The  Fortran  IV  Compiler  performs  the  following  specific 
actions  on  behalf  of  the  user 


A. 4.1  RECFM=F ,  V  or  U 


Language  READ  or  WRITE  implies  transfer  of  one  or  more  blocks. 
A. 4. 2  RECFM=FB 


A  buffer  is  transferred  only  after  it  is  filled  (except  for  last 
block  of  a  data  set)  .  Hence  one  block  may  be  shared  by  a  number  of 
language  read  or  write  statements. 

A. 4. 3  RECFM=VB 


A  buffer  is  transferred  only  after  the  space  remaining  in  it  is 
insufficient  to  store  a  record  segment  of  the  maximum  specified  size. 
Hence  one  block  may  be  shared  by  a  number  of  language  read  or  write 
statements . 


A. 4. 4  Fortran  Unformatted  V  and  VB 


The  Fortran  unformatted  (V  and  VB)  write  statements  insert  inform¬ 
ation  into  the  segment  control  words  of  the  record  segments  corres¬ 
ponding  to  a  logical  record  and  use  this  as  a  check  on  reading  so 
that: 


(a)  reading  always  commences  at  the  start  of  a  written 
logical  record 


(b)  reading  of  more  than  the  relevant  logical  record  causes 
an  error  exit  but  reading  of  only  part  of  it  is  allowed. 

The  information  code  is  inserted  in  the  third  byte  of  each 
segment  control  word  as  follows: 


binary  code 


meaning 


0 

1 

2 

3 


the  only  record  segment  of  a  logical  record 
the  first  record  segment  of  a  logical  record 
the  last  record  segment  of  a  logical  record 
a  record  segment  which  is  neither  the  first 
nor  the  last  of  a  logical  record 


1.4  Pseudo-Logical  Files 

These  files  are  logically  structured  in  that  they  show  no  apparent 
similarity  to  physical  device  structures.  They  are  still  not  at  the  user 
level,  in  that  the  facilities  provided  consist  of  basic  operations  on  the 
file  without  any  regard  for  the  users  scheme.  We  now  consider  an  example 
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of  a  pseudo-logical  file.  We  have  a  directed  tree  where  every  node  has 
an  address  (according  to  its  position  in  the  tree)  of  the  form 
(n,  i^,  i^>  •••>  i-,)  where  n  is  the  level  of  the  node  and  the  i  specify 
the  appropriate  nodes  in  the  path  from  the  root  node  to  the  specified 
node.  For  example  (3,2,4)  is  the  4th  node  in  the  filial  set  of  the  2nd 
node  in  the  filial  set  of  the  root  node.  In  this  scheme  the  nodes  of 
all  filial  sets  are  numbered  by  integers  from  one. 

The  operations  which  are  provided  for  the  file  are  add,  delete,  read, 
and  write,  a  node.  These  occur  in  two  modes  (a)  root  mode:  entire  address 
of  node  provided  (b)  context  mode:  current  node  and  number  of  node  in 
filial  set  provided. 

There  is  a  choice  of  ways  of  implementing  such  a  file  in  terms  of, 
say,  pseudo-physical  files  plus  organising  algorithms.  If  context  mode 
is  dominant,  then  the  technique  of  embedding  a  table  of  pointers,  to 
filial  set  members,  in  each  node  is  advantageous.  If  random  access  to 
nodes  is  prevalent,  implying  root  mode,  the  case  for  an  integrated  node 
address  to  storage  location  table  is  strong,  especially  for  trees  with 
many  levels.  In  this  way  direct  access  can  be  made  to  any  node  following 
location  table  access.  In  either  event  some  sort  of  spare  space  table  is 
required  which  specifies  the  available  space  for  new  nodes  in  the  pseudo¬ 
physical  file  space  utilised. 

The  following  step-by-step  breakdowns  of  the  read  and  add  in  context 
mode  are  shown  to  illustrate  the  mechanics  of  our  tree  file.  These  assume 
that  a  current  node  pseudo-physical  file  address  is  kept  in  "current  node 
pointer"  and  that  this  pointer  is  appropriately  updated  by  the  tree  file 
operations.  This  task  is  not  shown  in  the  illustrations.  Checks  for 
errors  are  also  not  shown. 

read  node:  context 

1.  pick-up  current  node  pointer 

2.  read  current  node  (pseudo-physical  read) 

3.  access  filial  set  table  using  number  of  required  node 

in  filial  set 

4.  read  required  node  (pseudo-physical  read) 

add  node:  context 


1.  pick-up  current  node  pointer 

2.  read  current  node  (pseudo-physical  read) 

3.  obtain  space  from  spare  space  table  (leaving  table 

approximately  updated) 

4.  write  new  node  (pseudo-physical  write) 

5.  update  filial  set  table  in  current  node 

6.  write  current  node  (pseudo-physical  write) 

Note  that  factors  such  as  size  and  variability  of  size  of  nodes  have 
been  ignored  in  this  treatment  and  that  these  factors  are  important  in 
determining  the  method  of  implementation. 
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All  the  operations  on  the  tree  file  dealt  with  nodes.  It  would  be 
possible  to  work  in  an  integrated  way  on  node  partitions  (increasing  the 
number  of  explicit  operations) .  In  many  cases  a  good  alternative  is  to 
add  a  mechanism  on  to  the  front  of  the  basic  tree  file  to  achieve  the 
same  ends.  This  is  effectively  partitioning  our  problems  into  treatment 
of  nodes  and  treatment  of  node  internals.  This  modularity  is  useful  for 
modification  purposes  and  also  because  each  module  may  be  utilised  with 
other  modules  (than  its  pair)  with  the  implied  common  use  of  file  facil¬ 
ities  at  a  high  level  within  different  applications.  This  is  an  import¬ 
ant  point  which  will  be  mentioned  again. 

Another  possible  extension  of  the  tree  file  is  to  associate  keys  with 
the  numbers  of  nodes  in  filial  sets  and  to  provide  a  'key  to  number'  table 
either  within,  or  external  to,  each  node  (but  associated  with  it) .  These 
keys  may  then  have  some  meaning  in  the  user  representation  of  the  system 
although  the  user  may  not  be  aware  of  their  use  in  the  particular  tree 
file  considered,  since  we  are  dealing  with  a  pseudo- logical  file. 

1 . 5  User-Logical  Files 

To  the  "non-programming  user"  a  file  is  an  object  consisting  of 
pieces  of  data  with  relationships  connecting  them.  In  order  to  have  a 
clear  picture  of  his  data,  the  user  draws  up  a  "user  representation"  which 
illustrates  the  data  characteristics  and  relationships  and  the  required 
operations  to  be  performed.  For  example,  let  us  look  at  a  "user  repre¬ 
sentation"  corresponding  to  conventional  data  processing.  Here  we  have 
the  situation  that  user  data  consists  of  descriptions  of  a  set  of  similar 
entities,  each  entity  description  we  will  call  a  record.  We  require  to 
inspect  each  entry  in  turn  to  discover  a  subset  with  certain  properties. 
Hence  we  could  imagine  that  the  records  are  "in  a  queue"  and  we  look  at 
one  at  a  time  in  sequence  until  we  get  to  the  end.  If  the  records  have 
unique  identifiers,  then  the  user  may  decide  to  keep  the  records  in  the 
order  of  their  identifiers  (assuming  such  an  order  exists) .  This  is  a 
data  relationship  condition  which  will  almost  certainly  affect  the  imple¬ 
mented  files,  but  which  is  basically  a  condition  in  the  user  representation. 
How  the  user  representation  is  mapped  onto  the  pseudo-logical  (or  other) 
file  level  depends  on  the  particular  use.  There  may  be  (apart  from 
meaning,  which  is  only  associated  with  the  user  level)  a  one-to-one  corres¬ 
pondence  between  the  user  representation  data  structures  and  the  pseudo- 
logical  data  structures  from  which  they  are  formed.  This  point  is  illus¬ 
trated  in  the  case  study  which  appears  later  in  this  section.  We  will 
look  briefly  at  some  of  the  characteristics  of  the  user-logical  area. 

This  area  is  fairly  well  treated  in  the  literature  [1,  Chap.  12],  [2, 

Part  II,  Chapter  5],  [4],  [8]. 

A  file  contains  data  for  a  number  of  entities.  For  example  in  a 
personnel  file,  the  employees  are  the  entities.  The  data  for  an  entity 
is  often  termed  a  record.  The  entity  data  consists  of  values  corres¬ 
ponding  to  different  properties  or  attributes  of  the  entity.  For  example 


20 


in  a  personnel  file  each  employee  has  a  name,  age,  employee  number,  etc. 
The  data  for  a  property  of  an  entity  is  often  termed  a  field. .  Since 
fields  correspond  to  user  quantities,  they  may  be  fixed  or  variable  in 
length.  We  will  just  delimit  two  points  relating  to  implementation  at 
this  level.  (i)  it  is  common  to  put  fixed  length  fields  before  the 

variable  length  fields  in  a  record. 

(ii)  variable  length  fields  have  their  length  delimited 
either  through  a  stored  length  specification  or 
through  the  use  of  special  symbols  as  delimiters. 

Note  also  that  sometimes  variable  length  fields  are 
treated  in  the  record  as  fixed  length  equal  to  the 
maximum  length  that  will  be  attained. 

Fields  may  be  formed  into  groups  where  the  groups  are  subentities 
having  some  meaning  to  the  user.  Examples  are:  Date  into  Day,  Month, 
Year;  Name  into  Surname,  initials;  Invoice  into  its  constituent  parts. 

Note  that  groups  may  be  nested,  i.e.,  a  group  may  consist  of  fields 
and  other  groups . 

An  important  concept  which  appears  in  practice  is  Multiplicity  of 
groups  or  fields.  One  record  with  a  particular  group  may  contain  many 
instances  of  that  group.  For  example,  consider  Car  Insurance.  There 
may  be  one  record  per  person.  This  record  may  contain  a  group  called 
accidents.  This  group  will  exist  in  a  record  on  the  basis  of  one 
instance  per  accident.  Note  that  in  some  circumstances  the  number  of 
instances  may  be  well  defined,  in  others  it  may  be  variable. 

We  close  this  section  by  looking  at  an  example  of  a  user  record. 

We  will  use  this  example  to  illustrate  how  the  user  representation  may 
be  implemented  in  two  different  pseudo-logical  representations. 

Example  Bank  Account  File 


Each  record  corresponds  to  a  single  person  account  and  is  as  follows. 

Account  Record 


j 

account  number  name  of  holder  deposits  withdrawals  balance 


surname  initials 

Deposits  and  withdrawals  are  both  groups  but  for  the  purposes  of  this 
example  we  do  not  need  to  know  their  structure.  We  will  now  delimit 
two  implementations  of  this  file. 
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Implementation  one 

Each  account  record  is  made  a  single  pseudo-logical  record  accessed 
by  account  number.  That  is,  the  pseudo-logical  record  contains  account 
number,  name,  balance,  all  deposits  and  withdrawals  as  a  unit.  Requesting 
an  account  record  provides  the  whole  record  (in  core  or  displayed,  which¬ 
ever  is  appropriate) .  The  advantage  of  this  is  simply  the  single  required 
access  to  retrieve  the  whole  record.  Disadvantages  are  (i)  records  which 
are  very  variable  in  size,  make  storage  allocation  and  maintenance  more 
expensive  and  involved  (ii)  if  only  a  part  of  the  record  is  required,  the 
whole  record  must  still  be  retrieved. 

Implementation  two 

Each  account  record  is  made  into:  (a)  one  main  pseudo- logical  record 
containing  account  number,  name,  balance  (and  maybe  how  many  deposits  and 
withdrawals) ,  (b)  a  series  of  deposit  records  linked  to  the  main  record, 

(c)  a  series  of  withdrawal  records  linked  to  the  main  record. 

We  thus  have  a  main  file,  a  deposit  file  and  a  withdrawal  file. 

Advantages  of  this  are  (1)  all  records  (except  possibly  main)  fixed 
in  size,  (2)  retrieval  of  parts  of  data  more  efficient. 

Disadvantages  are  (1)  retrieval  of  whole  is  less  efficient  than 
implementation  1,  (2)  additions,  deletions  and  updates  are  more  complex, 

(3)  a  mapping  mechanism  is  required  to  convert  from  the  user  representation 
to  the  pseudo-logical  representation. 
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CHAPTER  2 


Directories  and  Keyed-Access  Mechanisms 


2.1  The  Need  for  Locating  Mechanisms 


Suppose  we  have  a  file  of  N  entities  and  wish  to  inspect  only  one. 

We  could  simply  scan  an  unordered  file,  but  this  would  take  on  average 
N/2  inspections.  By  keeping  the  file  in  some  order  we  may  reduce  this 
to  approximately  log^N  inspections  using  the  binary  split  search  tech¬ 
nique.  [Note  that  m  case  1  searching  for  n  entities  requires  on 
average  approximately  (n+N)/(2n)  inspections  per  entity  whereas  in  case 
2  it  requires  log0N  inspections  per  entity  (assuming  no  attempt  to  combine 
individual  entity  searches)]. 

If  we  wish  to  add  a  new  entity  to  the  file,  the  positions  are  reversed. 
For  our  unordered  file  we  simply  add  the  entity  to  the  file  at  a  convenient 
place,  i.e.,  the  end.  For  our  ordered  binary  split  search  file,  however, 
we  must  put  the  new  entity  into  the  correct  position.  This  requires  on 
average  moving  N/2  entities. 

In  practice  we  require  reasonable  performance  both  for  inspection 
and  addition  of  entities.  The  techniques  above  do  not  satisfy  this 
need.  Therefore  other  "trade-off"  locating  mechanisms  are  required. 

2 . 2  Single  Key  Access 


Let  us  assume  we  are  interested  in  locating  the  data  for  an  entity 
corresponding  to  a  particular  key,  in  the  case  where  each  entity  has  a 
unique  key.  All  the  techniques  which  we  will  consider  can  be  considered 
to  consist  of  two  basic  types  of  operation: 

1 .  Obtaining  a  storage  address  by 

either  (a)  direct  application  of  a  function  to  a  data  value 
(calculation) 

or  (b)  picking  it  up  from  where  it  is  stored; 

2.  Content  inspection  i.e.,  inspecting  a  record  of  data  to  see  if  it 
satisfies  some  specified  criterion.  Here  "record"  is  used  to  refer 
to  a  stored  unit  of  data  in  general. 

From  these  basic  operations  it  is  useful  to  form  two  higher  level 
operations  which  often  appear  more  explicitly  as  components  of  accessing 
mechanisms.  These  are  search  or  scan,  and  transformation. 

1.  Search  or  scan  involves  inspecting  records  one  by  one  until  our  con¬ 
tent  inspection  criterion  is  satisfied.  This  implies  knowing  how 
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to  pick  up  the  "next"  record.  If  the  records  to  be  scanned  exist 
in  sequential  storage  "locations"  then  a  form  of  calculation  is 
normal.  If  the  records  to  be  scanned  are  non-adjacent  then  either 
chaining  by  address  storage  or  calculation  may  be  employed. 

Transformation  uses  a  key  as  input  and  provides  an  address  as  output. 
This  can  be  carried  out  by 


either 

(a) 

calculation 

or 

(b) 

table-look-up  in  a  key-address  table 
is  a  form  of  search) 

(this 

or 

a  combination  of  (a)  and  (b) . 

now  go  on 

to 

look  at  some  file  organisations  keeping 

the  above 

points  in  mind.  In  particular  we  will  look  at  tables,  array  addressing, 
hashing,  tree  searches  and  a  hybrid  of  the  latter  two  types. 

2.2.1  Tables  (directories) 

A  table  is  a  logical  list  of  key-address  pairs  (in  our  case)  and 
we  assume  serial  table  search  is  performed.  We  can  use  tables  for 
locating  in  several  ways. 

Case  1  -  Complete  Directory 

The  table  has  a  single  entry  for  each  keyed  entity  in  the  file. 
Important  Points 

(i)  Total  number  of  accesses  is  number  of  accesses  required 
to  locate  entry  in  directory  plus  one  access  from  the 
data. 

(ii)  May  keep  directory  in  different  way  from  data.  For 
example  we  may  order  the  keys  in  the  directory  but 
keep  the  main  file  unordered.  Hence  it  is  easy  to 
add  a  new  data  record  and,  because  directory  records 
are  smaller,  it  is  less  expensive  to  add  a  new  key. 

We  consider  the  cost  of  retrieving  a  data  record  by  key.  We 
require  on  average  to  scan  half  of  the  directory  so  that  in  a  random 
storage  environment  (all  storage  accesses  having  equal  cost)  we 
require  an  average  of  N/2  +  1  accesses.  We  assume  that  the  size 
ratio  of  a  key  entry  in  the  directory  to  a  data  record  equals  R,  a 
constant  (<<1) .  Then  the  number  of  auxiliary  storage  accesses  is 
on  average  R(M/2)+l  as  opposed  to  M/2  for  the  unordered  sequential 
search  file,  where  M  =  N/Ng  and  Ng  is  the  number  of  data  records 
per  auxiliary  storage  block. 
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This  is  a  considerable  improvement  if  R  <<  1.  Note  that  the 
same  order  of  improvement  applies  to  update  of  key  directory  over 
data  record  file. 

Note  that  if  we  use  a  binary  split  search  technique  for  our 
directory,  instead  of  serial  search,  then  a  similar  result  applies, 
since,  for  smaller  records  (key  instead  of  data)  the  binary  split 
search  will  locate  to  an  auxiliary  storage  block  sooner  (note  the 
storage  focussing  property  of  the  method) . 

It  is  possible  [1,  p . 128]  to  use  linked  list  tables. 

However,  despite  the  appending  advantage  gained,  inspection  becomes 
slower  due  to  (a)  non-applicability  of  binary  split  search 
(b)  physically  non-contiguous  locations  which  worsen  the  auxiliary 
storage  accesses. 

Case  2  -  The  Partial  Single  Directory  -  Partial  Search 

The  directory  in  this  case  contains  one  entry  for  a  group  of 
entities .  The  data  records  are  ordered  on  key  and  each  group  entry 
consists  of  (a)  address  of  (first  record  in)  group  (b)  key  of  last 
record  in  group. 

To  retrieve,  one  looks  up  the  table  to  find  the  group  and  then 
searches  the  group  to  find  the  data  record.  If  there  are  N  keys 
and  X  per  sublist,  then  there  are  N/X  entries  in  the  directory  and 
the  number  of  accesses  is  given  by 

NX  . 

—  +  —  on  average  m  random  storage. 

Za  Z 

This  is  a  minimum  when 

x  = 

giving  ^N  for  the  optimum  average  number  of  accesses  in  random 
storage  obtained  by  varying  X. 

Considering  the  auxiliary  storage  accesses  we  have  that  the  average 
number  of  these  is  given  by 

RN  X 
2XNB  +  2Nb  • 

Differentiating  with  respect  to  X  we  have  that  this  is  minimum  when 


X  =  /RN . 
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We  see  therefore  that  optimum  X  random  >>  optimum  X  auxiliary 
and  that  under  optimum  conditions  the  average  number  of  auxiliary 
storage  accesses  is  given  by 


/R  /N 


Note  that  at  optimum  X  auxiliary  the  average  number  of  random 
accesses  is  given  by  v^"(l+R) / (2R)  which  is  much  greater  than  the 
optimum  /N  when  R  <<  1 . 

Case  5  -  The  Partial  Multiple  Directory  -  Partial  Search 

In  this  case  we  have  a  single  master  directory  which  locates 
to  several  subdirectories,  etc.,  until  we  again  reach  a  list  of 
data  records  to  be  searched.  We  have  replaced  the  simple  directory 
of  Case  2  with  a  directory  hierarchy. 

Let  us  look  at  the  case  where  we  have  just  two  levels  of  direc¬ 
tories.  If  there  are  N  entities  in  the  file,  then  we  divide  the 
file  into  N/X^  lists  of  X^  elements  each  and  place  N/X^  entries  in 

the  subdirectory  level.  We  now  divide  the  N/X-^  directory  entries 

into  N/X^X2  lists  of  X^  entries  and  place  the  N/X1X?  entries  in  the 

main  directory. 

We  have,  therefore,  that  the  average  number  of  random  accesses 
for  a  retrieval  is  given  by 

N  X2  X1 
2XtX9  +  2  +  2  * 

i 

Differentiating  with  respect  to  X^  and  X 2  separately  gives 

xi  =  N,  X2X^  =  N. 

Hence  X^  =  X2  =  for  optimum  average  number  of  random  accesses 
for  retrieval  and  this  optimum  value  is  3  ^N/ 2 . 

Considering  the  auxiliary  storage  accesses  we  have  that  the 
average  number  of  these  is  given  by 

RN  ^2 

2NBX1X2  +  2NB  +  2NB  ' 

Differentiating  with  respect  to  X  and  X2  separately,  we  get  that 
optimum  occurs  at  Xx  =  R  and  X2  =  Vn/R  giving  an  optimum 

average  number  of  auxiliary  accesses  of 

3R  ^NAR 
2Nb  ’ 
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With  these  values  for  and  the  average  number  of  random  accesses 
is  given  by 

WR  (l+R/2)  . 

Comparative  Example 

Let  us  now  do  a  comparative  example  which  demonstrates  the  power 
of  the  different  organisations  and  also  the  effects  of  using  both 
random  and  auxiliary  optimum  methods.  To  do  this  we  will  show  the 
resulting  random  and  auxiliary  accesses  required  for  both  optimum 
cases . 

We  choose  values  of  N,  ND,  and  R  to  be 

D 

N  =  32,768,  ND  =  8,  R  =  1/8 

D 

The  following  table  shows  the  relative  performances  of  the 
chosen  cases. 


method 

number  of 
auxiliary 
accesses 

number  of  random; 
(total)  accesses; 

i 

t 

j 

serial  search 

2,048 

• 

16,384  j 

{ 

1 

table  1  entry 
per  entity 

257 

I 

16,385 

! 

auxiliary  1  level  table 

optimising  2  level  table 

t 

! 

8  |  288 

3  1  68 

1 

random  1  level  table 

optimising  2  level  table 

j 

12.7  180 

3  j  48 

The  improvement  as  we  go  towards  more  levels  of  reference  is  quite 
clear  in  this  example  for  both  optimising  conditions. 

If  we  look  at  the  1  level  table  case,  we  see  that  the  auxiliary 
optimising  case  has  gained  approximately  5  auxiliary  storage  accesses 
at  the  cost  of  108  extra  random  (non-auxiliary)  accesses. 

If  we  look  at  the  2  level  table  structure  we  see  that  the  auxiliary 
optimising  case  costs  20  extra  random  accesses  and  has  not  gained  any 
auxiliary  storage  accesses.  This  is  because  the  absolute  minimum  of 
auxiliary  accesses  for  this  structure  is  3  considering  our  method  of 


27 


block  arrangement.  This  illustrates  how  a  structure  optimised  for 
a  certain  magnitude  of  problem  can  lose  out  under. less  demanding 
conditions . 


If  we  assume  that  an  auxiliary  access  costs  1  msec  and  a  core 
access  costs  1  ysec  we  have  a  gain  of  approximately  4.9  msec  and 
a  loss  of  20  ysec  using  the  auxiliary  as  compared  to  the  random 
optimisation  with  the  1  and  2  level  table  structures  respectively. 

Note  that  with  our  formulae  auxiliary  optimising  gives  better 
performance  with  core-drum,  core-disc,  etc.,  but  worse  performance 
with  core-bulk  core  when  compared  to  random  optimising. 

From  the  above  discussion  it  is  clear  that  neither  the  random 
nor  the  auxiliary  approximation  model  can  be  assumed  to  give  the 
best  results  under  all  storage-pairs.  For  this  reason  it  is  advan¬ 
tageous  to  go  to  a  more  exact  model  which  we  shall  call  the  random- 
auxiliary  model.  In  this  model  we  recognise  that 

(a)  We  have  the  same  number  of  random  accesses  as  for  the 
random  model 

(b)  Some  of  the  random  accesses  incur  an  extra  auxiliary 
access  penalty  over  the  random  core  access  cost 

(c)  In  a  random  access  we  include  a  constant  processing 
overhead  (for  comparisons,  etc.)  to  add  to  the  storage 
access  giving  a  uniform  cost  C 

(d)  We  assume  all  auxiliary  accesses  have  a  constant  cost  A. 


In  the  following  development  we  consider  only  the  1  level  table. 
The  function  we  now  wish  to  minimise  is  C+A  which  is  given  by 


+ 


rJ*N_  _X, 

vXNd  2Ndj 

D  D 


Differentiating  with  respect  to  X  we  obtain  optimal  X  given  by 


^  /(Nb  +  R(A/C))/(Nb  +  (A/C))  . 


We  see  that 

(a)  if  CNd  <<  A,  then  X  ->  /RN  (auxiliary  model) 

B 

(b)  if  A  =  0,  then  X  =  v^T  (random  model) 

It  is  interesting  to  consider  two  arbitrary  cases. 
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Case  (i) 


A/C  =  8,  R  =  1/8,  N  =  8,  N  =  32,768 

D 

then  the  optimum  average  number  of  random  plus  auxiliary  accesses 
for  retrieval  occurs  when  X  =  3/N/4 . 

This  gives  an  optimum  average  cost  per  retrieval  of  approximately 
10. 3A  +  188C . 

We  let  C=1  cost  unit,  then  A=8  cost  units  and  the  optimum  average 
cost  per  retrieval  is  270  units. 

In  comparison  the  cost  in  our  model  using  optimum  random  X  is  282 
units  (12. 7A  +  180C)  whereas  the  cost  using  optimum  auxiliary  X  is  352 
units  (8A  +  288C) .  This  case  shows  random  fairly  close  to  random- 
auxiliary  with  auxiliary  much  worse. 

Case  (ii) 


A/C  =  64,  R  =  1/8,  ND=  8,  N  =  32,768 

D 

then  the  optimum  average  number  of  random  +  auxiliary  accesses  for 
retrieval  occurs  when  X  =  /2N/3 

This  gives  an  optimum  average  cost  per  retrieval  of  approximately 
8.33A  +  234C . 

Keeping  C  =  1  cost  unit  as  in  case  (i) ,  then  A  =  64  cost  units 
and  the  optimum  average  cost  per  retrieval  is  767  units.  In  comparison 
the  cost  in  our  model  using  optimum  random  X  is  993  units  (12. 7A  +  180C) 
whereas  the  cost  using  the  optimum  auxiliary  X  is  800  units  (8A  +  288C) . 

This  shows  auxiliary  fairly  close  to  random- auxiliary  with  random 
much  worse. 

It  is  worthwhile  making  a  few  comments  about  the  use  of  this  model. 

The  constant  cost  for  each  access  is  clearly  not  realistic  when 
looking  at  storage  devices.  However,  if  we  consider  a  non-dedicated 
device  in  a  multiprogramming  environment,  we  have  no  way  of  specifying 
individual  costs  (which  vary  with  job  load  as  well  as  device  head 
positions,  etc.).  We  are  thus  obliged  to  work  with  the  average  cost 
as  a  convenient  first  approximation.  The  model  theory  then  implies 
that  we  are  optimising  the  average  cost  rather  than  the  cost  under 
all  conditions. 

In  utilising  the  theory  to  make  decisions  the  ratio  A/C  is  clearly 
required  to  some  degree  of  accuracy.  Since  C  involves  a  processing 
cost  this  is  not  just  a  question  of  storage  access  ratios.  For  example. 
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in  the  retrieval  method  we  have  been  considering,  processing  involves 
key  comparison.  (Note  that  with  our  equations  this  assumes  the  data 
record  retrieval  access  also  involves  a  key  comparison.)  This  type 
of  processing  is  likely  to  be  fairly  easy  to  cost  but  will  be  depend¬ 
ant  to  some  extent  on  key  length. 

It  is  important  to  note  that  we  have  worked  entirely  with  access 
optimising  and  ignored  updating  costs.  Since  updating  is  important, 
we  will  now  attempt  to  cost  updating  with  one  of  our  table  structures. 

Case  4  -  Updating  in  the  Partial  Single  Directory  -  Partial  Search  File 

It  is  difficult  to  cost  updating  in  general  and  so  we  assume 
certain  operating  conditions  which  simplify  our  calculations.  We 
assume  that  we  have  a  space  trade-off  allowance  for  each  set  of  X 
records  which  allows  us  to  add  a  percentage  of  X  to  each  list  with¬ 
out  requiring  more  space  (locally)  for  the  extended  list.  We  assume 
that  the  key  table,  considered  as  a  whole,  can  also  be  extended  in 
locally  available  storage. 

To  add  a  single  entity  to  the  file  involves  scanning  on  average 
half  the  key  table  plus  the  operation  of  adding  the  new  data  record. 
When  we  find  the  key  in  the  table  which  is  greater  than  the  new  key, 
then  we  add  the  new  record  to  the  corresponding  data  list.  This 
implies  no  change  to  the  key  table.  The  new  data  record  will  on  aver¬ 
age  be  inserted  half  way  down  the  data  list.  This  implies,  on  average 
that  all  data  records  in  the  list  will  be  read  and  one-half  will  be 
written.  Thus  with  the  random  access  model  we  have  that  the  number  of 
random  access  is  given  by 


This  is  optimum  when  X  =  /N/3  which  gives  the  optimum  average  number 
of  random  accesses  to  add  a  new  entity  (to  the  all  list  sizes  equal  X 
condition)  to  be 

/3N  +  1. 

With  the  auxiliary  access  model  we  have  that  the  number  of  auxiliary 
accesses  is  given  by 

RN  (3X/2  +  1) 

— * 

which  is  optimum  when  X  =  /RN/3,  giving  the  optimum  average  number  of 
random  accesses  to  add  a  new  entity  (to  the  all  list  sizes  equal  X 
condition)  to  be 


(/3RN  +  1) 
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From  these  expressions  we  see  that  single  update  under  suitably 
controlled  storage  conditions  has  the  same  order  of  magnitude  cost 
as  single  retrieval.  We  also  see  that  the  optimum  X  values  for 
retrieval  and  update  are  appreciably  different. 

Clearly  a  time  comes  when  a  file  either  requires  more  space  for 
a  list  or  has  deviated  sufficiently  from  optimum.  Since,  with  our 
update  policy,  the  data  lists  are  still  in  order,  a  single  pass  is 
sufficient  to  regroup.  It  is  possible  that  some  lists  require  only 
reading  (not  writing) .  However,  it  is  unlikely  that  many  lists  will 
have  been  unaffected  by  say  a  10  per  cent  increase  in  size  on  our 
original  file  size. 

We  assume  that  a  pass  requires  read  and  write  of  all  lists.  As 
we  pass  through  the  list,  we  create  an  entirely  new  key  table  which 
requires  only  writing  (no  reading)  for  its  production. 

Let  the  new  file  size  be  entities  and  the  new  list  size  be  X^. 
Then  the  number  of  random  accesses  required  for  gross  reorganisation 
is  given  by 

Ni 

jq-+  2Nr 

Inserting  the  optimum  random  update  value  for  X^  we  obtain 

/3N^  +  2NX  • 

The  number  of  auxiliary  accesses  required  for  gross  reorganisation 
is  given  by 


RN  2N 
+ 


xinb  nb 


Inserting  the  optimum  auxiliary  update  value  for  X^  we  obtain 


/IrFl 


2N. 


N 


B 


N 


B 


If  we  look  at  the  above  formulae  expressed  in  terms  of  X  ,  then, 
for  both  the  random  and  auxiliary  cases  the  minimum  cost  is  attained 
when  X^  =  (the  maximum  possible  value) .  This  is  equivalent  to  a 
single  entry  key  table  situation  thus  demonstrating  the  cost  of  building 
key  tables.  This  observation  generally  carries  over  to  structures  in 
general  (obvious?).  If  we  wish  for  powerful  retrieval  mechanisms,  we 
must  pay  to  build  them. 
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We  assume  that  I  is  the  increase  factor  we  allow  in  file  size 
before  reorganising.  That  is  if  a  reorganisation  occurs  at  size  N 
entities,  then  the  next  reorganisation  occurs  at  N  +  NI  entities. 

We,  therefore,  take  the  reorganisation  cost,  average  it  over  the 
added  NI  entities  and  add  this  average  cost  to  the  previously  deter¬ 
mined  average  update  cost. 

We  thus  have  an  optimum  average  number  of  random  accesses  per 
update  including  reorganisation  given  by 


/3N  +  3  +  =r  + 


/3  /I+I 
/N  I 


The  optimum  average  number  of  auxiliary  accesses  per  update  including 
reorganisation  is  given  by 


/3RN+3  2 

nb  +  inb  + 


/3  /r  /ITT 
ind/n 

D 


Let  us  consider  the  case  when  R  =  1/8,  N  =  32,768,  Ng  =  8,  I  =  .1 
Then  the  optimum  average  number  of  random  access  is  337  (compared  to 
314  without  reorganisation)  and  the  optimum  average  number  of  auxiliary 
access  is  16.6  (compared  to  13.9  without  reorganisation).  Both  these 
figures  are  approximately  twice  the  optimum  average  retrieval  figures. 

For  comparisons  sake  let  us  now  consider  batch  updating  N/10  new 
entities  to  the  N  size  file.  For  this  we  require  one  pass  through 
the  updates,  the  old  file  and  the  new  file,  plus  recreating  the  key 
table.  This  will  give  us  the  same  values  as  for  reorganisation. 
However,  we  have  assumed  that  our  update  file  is  in  order,  so  we  should 
add  the  ordering  cost.  This  latter  cost  depends  on  the  ordering  method 
chosen. 

2 

If  we  assume  an  "N  "  ordering  method  such  as  the  minimum-in-pass 
technique  then,  depending  on  the  number  of  interchanges  required,  we 
get  figures  of  approximately  1638  to  3276  random  accesses  and  205  to 
410  auxiliary  accesses  per  entity  in  the  ordering.  In  considering 
these  figures  the  following  points  should  be  noted. 

(i)  A  random  access  corresponds  to  the  accessing  and  comparing 
of  two  keys.  This  is  effectively  equivalent  to  the  random 
access  unit  used  in  the  random  and  random-auxiliary  models 
(see  the  comments  regarding  the  A/C  ratio  of  the  latter 
model) . 

(ii)  In  the  auxiliary  case  calculations  a  core  area  of  two 
blocks  to  hold  the  data  has  been  assumed. 

Combining  these  ordering  values  with  our  batch-update  cost  we  obtain 
an  average  update  cost  per  entity  of  1666  to  3299  random  accesses  and 
207.7  to  412.7  auxiliary  accesses. 
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On  the  other  hand,  if  we  assume  an  "N  log^N"  ordering  method 
such  as  a  two-way  sort-merge  we  obtain  an  average  update  cost  per 
entity  of  35  to  47  random  accesses  and  4.2  to  5.7  auxiliary  accesses. 

These  last  values  show  a  clear  cut  advantage  over  "on-line" 
update.  This  illustrates  also  the  need  for  care  with  ordering  tech¬ 
niques  and  the  possible  cost  they  can  incur. 

In  looking  at  "on-line"  update  we  utilised  a  spare  storage  tech¬ 
nique  to  enable  us  to  avoid  "movement  of  record"  problems,  hence  making 
the  formulae  easier  to  derive  and  also  involving  us  in  a  less  expensive 
environment  time-wise.  This  is  an  example  of  the  use  of  overflow 
storage  [1,  Chap.  6].  We  use  global  and  local  storage  slightly 
differently  from  the  reference  in  that  we  use  them  in  a  relative  sense. 

Overflow  storage  is  local  to  a  unit  if  it  is  exclusively  for  that 

unit. 


Overflow  storage  is  global  to  a  unit  if  it  is  shared  by  other 
units . 

In  our  example,  the  overflow  is  local  to  the  key  table  and  to 
each  sublist.  We  could,  for  example,  have  had  one  global  overflow 

for  all  units  with  separation  and  retrieval  via  chained  scan.  Note 

that  although  chained  scan  would  not  affect  the  random  model  figures, 
it  would  adversely  affect  the  auxiliary  (and  random-auxiliary)  model 
figures,  due  to  the  greater  spread  of  a  chain  through  many  auxiliary 
storage  records. 

We  close  this  section  with  a  few  points  on  the  index  sequential 

file  organisation  which  is  an  implementation  of  a  table  access  file. 

Points 

1.  ."keyed  access"  utilising  indexes  (tables) 

2.  indexes  tailored  to  disc  storage  -  cylinder  and  track  indexes 

3.  track  scan  for  keys  using  count-key-data  track  format 

4.  a  portion  of  each  cylinder  kept  for  overflow  from  rest  of 

cylinder 

5.  chained  overflow  for  each  track  with  content  ordering  through 

track  and  overflow 

6.  index  record  contains  key  of  highest  record  on  track  and  key 

of  highest  record  for  track  in  overflow 

7.  sorted  file  must  be  provided  at  initialisation 

8.  overflow  is  unblocked,  i.e.,  one  record  per  block 

Remarks 

1.  Point  8  leads  to  slow  random  retrieval  under  moderate  to  heavy 
update  conditions. 
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2.  Must  retrieve  indexes  for  random  retrievals.  This  can  be 

avoided  if  plenty  of  buffer  space  is  available  for  the 
file  by  keeping  indexes  in  the  buffers. 

3.  Sequential  access  is  fairly  fast  since  it  is  possible  to 

handle  each  group  of  records  per  index  entry  purely 
sequentially. 


2.2.2  Arrays 

This  assumes  that  we  are  able  to  convert  our  set  of  possible 
identity  keys  by  one-to-one  correspondence  into  addresses  of  a 
rectangular  array  such  that  a  dense  structure  is  formed  by  the  actual 
keys.  This  occurs  mainly  when  the  user  is  able  to  choose  his  keys 
to  ensure  this.  When  this  is  possible,  then  the  technique  is 
extremely  fast,  providing  effectively  exact  location  on  the  basis  of  a 
single  relatively  simple  calculation.  Very  often  in  practice  this 
technique  is  combined  with  other  techniques  such  as  hashing.  A  good 
example  of  this  appears  in  [54] . 

2.2.3  Hashing 


This  is  where  we  do  a  calculation  as  above  but  may  have  more  than 
one  key  associated  with  a  given  address  and  have  little  knowledge  of 
the  distribution  of  keys  into  addresses.  It  is  therefore  necessary 
to  have  a  mechanism  to  deal  with  this  occurance.  The  two  conventional 
methods  of  dealing  with  this  correspond  to  local  and  global  overflow 
for  each  "hash  address"  in  a  logical  sense  [36],  [48],  [50].  In  the 
local  case,  colliding  entries  are  chained  together  in  a  list  utilising 
an  overflow  area  separate  from  the  hash  address  area  (see  multilist 
analogy  later),  i.e.,  each  hash  address  generates  an  exclusive  list 
(although  normally  a  common  global  storage  space  is  assigned) .  In 
the  global  case  the  collision  involves  looking  for  an  available  entry 
slot  by  some  method  where  the  empty  slot  is  also  in  the  hash  address 
space.  Hence  any  collision  entry  may  go  into  any  empty  slot  in 

principle.  The  empty  slots  are  usually  found  by  employing  a  calcu¬ 

lation  scan  technique. 

i.e.,  1  if  collision,  offset  by  1 

i.e.,  2  if  collision,  offset  by  f  (hash  address)  or 

possibly  f  (key) 

This  technique  gives  good  results  for  N  reasonably  small  and  the 
storage  area  up  to  80  or  90%  utilised.  If  N  is  large,  the  following 
problems  may  arise. 

(1)  clustering  due  to  non-randomness  of  hash  addresses 

(2)  if  we  try  to  remove  this  by  increasing  the  number  of 
available  hash  addresses,  this  leads  to  sparse 
storage  which  for  large  N  is  unacceptable. 

long  chains  through  clustering  implies  many  auxiliary 
accesses . 


(3) 
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Thought  -  The  following  thought  is  worth  consideration. 

By  hashing  into  a  small  hash  address  space  in  large. N  case,  thus 
practically  ensuring  clusters  at  all  addresses,  each  hash  address 
may  correspond  to  one  (or  more)  exclusive  auxiliary  storage  areas 
and,  even  with  chaining,  auxiliary  accesses  will  be  minimised.  It 
is  possible  also  to  have  another  hash  within  each  local  area  to  cut 
down  the  chain  lengths  to  more  reasonable  sizes.  We  will  come  back 
to  this  point  later  on  in  the  notes. 

2.2.4  Trees 

To  some  extent  a  two- level  hashing  facility  can  be  thought  of  as 
a  tree  structure.  However,  in  this  section  we  are  concerned  with 
tree  structures  where  each  node  contains  key  data  to  be  matched  against 
input  key  data  in  order  to  determine  which  branch  of  the  node  to  follow 
in  a  search  process. 

The  classical  example  of  this  is  the  binary  search  tree  [26],  [36], 
where  one  entire  key  is  held  in  each  node  and  used  as  follows 

input  keys  <  node  key  -*  left  branch 

input  keys  >  node  key  -*  right  branch 

input  key  =  node  key  ->  match,  no  further  searching.  New  keys  are 
stored  at  the  first  vacant  node  reached.  With  a  balanced  tree  (all 
terminal  nodes  equidistant  from  the  root  node)  a  log^N  retrieval  is 
obtained.  Note  however,  that  a  chained  list  is  a  special  case  of  the 
binary  tree,  and  it  has  order  N  retrieval. 

Averaging  over  all  shapes,  assuming  each  shape  equally  likely  and 
each  key  equally  likely  to  be  retrieved,  gives  approximately  a  1.4  log^N 
order  average  retrieval.  The  orders  mentioned  here  are  all  based  on 
random  access  storage.  Some  formula  on  n-way  search  trees  are  given 
in  a  later  section  regarding  their  performance  for  large  N  in  a  two-level 
storage  environment. 

2.2.5  Calculated  and  Hybrid  Trees 


In  this  section  we  consider  tree  structures  which  are  more  complex 
structures  than  those  we  considered  previously.  A  calculated  tree  is 
one  whose  structure  is  defined  by  a  calculation  technique  such  as 
hashing  [33] .  For  instance  we  could  apply  a  hash  function  to  a  key 
and  divide  the  hash  result  into  a  sequence  of  binary  digits.  Then, 

considered  from  left  to  right,  the  sequence  of  digits  could  specify 
the  path  in  a  binary  tree  from  the  root  to  the  node  associated  with 
the  key.  Note  that  in  this  type  of  tree,  keys  are  associated  with 
terminal  nodes  only.  A  good  representation  of  this  type  of  tree  is 
given  in  [33]  and  so  we  will  not  include  any  formalisation  here. 

Hybrid  trees  are  trees  which  consist  of  partitions  having  different 
characteristics.  For  example  we  could  have  a  tree  consisting  of  a 
hash  tree  which  locates  to  groups  of  keys,  together  with  a  binary  search 
tree  for  each  group  so  formed. 
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We  will  not  concern  ourselves  with  the  performance  of  pure 
calculated  (or  hash)  trees.  Hybrid  trees  are  interesting  since 
they  offer  the  possibility  of  utilising  different  local  and  global 
optimising  strategies  which  can  be  tailored  to  the  implementation 
environment.  In  the  next  section  we  will  consider  the  performance 
of  various  tree  structures  in  a  two- level  storage  environment. 

2.2.6  The  Performance  of  Some  Tree  Structures  in  a  Two-Level  Store 

To  enable  us  to  measure  the  performance  of  structures  in  a  two- 
level  store  we  need  first  to  define  the  properties  of  our  storage 
system  and  next  to  define  the  measure,  or  measures,  with  which  we  will 
make  our  comparisons . 

We  take,  as  our  storage  model,  the  auxiliary  model  introduced  in 
section  2.2.1.  In  this  model  only  the  cost  to  transfer  data  between 

levels  is  taken  into  account.  We  further  assume  that  this  cost  is 
independent  of  block  size. 

We  consider  the  performance  of  static  structures.  No  account 
will  be  taken  of  the  requirements  for  setting  up  and  maintaining  such 
structures . 

Our  performance  measure  is  termed  the  upper  access  bound  (UAB) . 
This  is  defined  as  the  maximum  number  of  slow-level  accesses  required 
to  retrieve  a  key,  given  in  terms  of  the  number  of  keys  contained  in 
the  structure.  We  introduce  two  quantities  which  we  will  utilise  in 
deriving  expressions  for  our  measure.  The  node  size  (NS)  is  defined 
as  the  amount  of  storage  required  by  each  node  of  the  structure. 

The  block  branching  factor  (BBF)  is  defined  as  the  number  of  pointers 
in  a  block  which  lead  to  nodes  of  the  structure  existing  outside  that 
block . 

In  all  our  measurements  we  assume  that  the  probability  of  a 
search  is  the  same  for  each  key  in  the  structure.  We  will  also  work 
with  balanced  trees  since  these  are  easier  to  treat  and  sufficient 
for  our  purpose. 

2. 2.6.1  Key-Node  Trees 

Each  node  in  an  r-way  key-node  tree  consists  of  r-1  keys,  r-1 
pointers  and  r  node  pointers.  The  key  to  be  matched  is  compared  with 
each  key  in  the  node  from  left  to  right  say,  where  the  node  keys  are 
ranged  according  to  some  ordering  rule.  If  a  match  is  achieved,  the 
search  is  complete.  If  no  match  occurs,  then  a  node  pointer  is 
followed  according  to  the  position  of  the  input  key  in  the  ordering  of 
the  node  key 

if  K<K1  follow  P1 

if  K>K  ,  follow  P 
r-1  r 

if  K  <K<K.  .  follow  P.  , 

1  l+i  i+i 
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where  K. ,  i=l,  2,  ...  r-1  are  the  keys  and  P.,  i=l,  2,  ...  r  are  the 
node  pointers.  1 

The  size  for  each  node  is  given  by 

NS  =  (r-1)  Lk  +  (2r-l)  Lp  . 

We  assume  that  we  may  choose  such  that 
j  1  S 

2'*'  “1  3. 

~YZi  =  NS-  =  an  integer. 

This  allows  us  to  store  the  first  j  complete  levels  of  the  tree  in  a 
single  block.  In  this  case  the  block  branching  factor  is  given  by 

(r-1) S 

BBF  ■  r3  =  Ts-^+  !• 

Here  we  have  assumed  that  the  tree  has  at  least  j+1  complete  levels. 

We  now  define  the  storage  policy  according  to  which  each  node  pointer 
making  up  the  BBF  value  will  point  to  the  first  node  of  another  block. 
We  thus  form  a  block  tree  where  the  "filial  set"  size  [26]  for  each 

block  node  is  the  BBF  for  that  block  (which  is  constant  in  this  case) . 

Noting  that  each  block  contains  (r-1)  S  /NS  keys,  it  follows  that  the 
upper  access  bound  for  the  tree  is  given  by 

UAB  =  p  for  p  =  1,  2,  ... 

(r-l)S  p-1  , 

if  N  £  - (1  +  Z  BBF  ). 

NS  k=l 

Note  that  this  upper  access  bound  is  of  the  order  of 

r  N.NS  1 
gBBF  L(r-l)SsJ* 

It  is  interesting  to  consider  the  effect  of  varying  r  for  the 
key-node  trees.  Consideration  of  the  upper  access  bounds  shows  that 
the  performance  is  a  function  of  positive  powers  of 

S 

a 

Lk  +  L  (2+1/ (r-1) )  ‘ 

Hence  the  upper  access  bound  for  balanced  key-node  trees  improves 
with  increasing  r.  Note  that  this  is  the  opposite  result  to  that 
obtained  by  considering  random  access  storage.  In  this  case  we  can 
have  the  best  of  both  worlds  by  replacing  the  linear  search  of  each 
node  by  a  binary  search.  In  doing  this  we  have  taken  a  binary  tree 
and  implemented  it  by  amalgamating  nodes  and  replacing  explicit  by 
implicit  pointers. 
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It  is  well  known  that  the  binary  search  tree  is  a  development  of 

the  binary  search  which  has  more  desirable  update,  capabilities  at  the 

cost  of  storage  size.  The  binary  search  key-node  trees  considered 

here  represent  intermediate  stages  between  the  two  techniques.  The 

case  r=2  corresponds  to  the  binary  search  tree,  whereas  the  case 

r=N+l  corresponds  to  the  binary  search,  except  for  the  presence  of 

unused  pointers.  It  should  be  noted  that  the  formulae  presented  here 

apply  only  if  NS  <  S  ,  with  the  best  upper  access  bound  performance 

obtained  when  NS  =  Sa. 

a 

A  further  property  that  appears  as  expected  in  the  formulae  is 
that  for  fixed  r  the  performance  decreases  with  increasing  key  length. 

2. 2.6. 2  Hybrid  Trees 

In  this  section  we  consider  a  form  of  hybrid  tree  where,  if  we 
imagine  a  tree  of  blocks  superimposed  on  the  tree  of  nodes,  the  non¬ 
terminal  blocks  consist  of  calculated  (keyless)  nodes  and  the  terminal 
blocks  (BBF=0)  consist  of  key-node  trees.  In  this  case  we  have  that 
the  calculation  trees  focus  down  to  a  block  of  keys  whereas  pure 
calculation  trees  focus  down  to  an  individual  key.  The  keyless  nodes 
in  the  non-terminal  blocks  consist  of  r  pointers  (r-way  case)  so  that 
NS  =  rL  where  a  subscript  c  is  used  to  denote  those  quantities  corres¬ 
ponding  ?o  the  calculated  portions  of  our  hybrid  trees.  For  the 
purpose  of  these  derivations  we  assume  that  the  key  nodes  in  the  term¬ 
inal  blocks  are  arranged  as  balanced  r-way  key-node  trees.  Although 
the  internal  organisation  of  the  terminal  blocks  is  unimportant  in  the 
derivations,  except  as  regards  the  number  of  keys  which  can  be  stored 
in  each  block,  there  are  other  good  reasons  for  this  choice,  as  will 
be  seen  later. 


We  thus  have  that  the  number  of  keys  per  terminal  block  (N^)  is 
given  by 

i 

(r-l)S 


N, 


We  assume  that  we  may  choose  such  that 


k  _  S 
r  -1  a 


r-1  NS 


for  some  k. 


(r-i) S 

Then  BBF  =  r  =  — — -  +  1 . 
c  NS 


As  for  the  key-node  trees,  we  may  arrange  the  keyless  nodes  into 
a  block  tree  with  the  difference  that  the  number  of  keys  in  the 
structure  is  now  given  by  N  multiplied  by  the  sum  of  the  BBF^  for  all 

L  C 
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blocks  at  the  last  non-terminal  block  level.  It  follows  that  the 
upper  access  bound  for  the  tree  is  given  by 

UAB  =  p  for  p  =  2,  3,  ... 

if  N  <  N . .  (BBF  )P_1. 
t  c 


Note  that  this  upper  access  bound  is  of  the  order  of 


It  can  be  shown,  as  for  key-node  trees,  that  the  upper  access 
bound  improves  with  increasing  r.  It  should  be  noted  that  the 
formulae  presented  here  applies  only  for  NS  <  S  . 

C  3. 


2. 2. 6. 3  A  Comparison  of  the  Key-Node  Trees  and  Hybrid  Trees 

If  we  consider  the  order  of  the  upper  access  bounds,  we  have  that 
for  r-way  key-node  trees  it  is 

r  N.NS 

i0gBBF  Lfr-l)S  J  ’ 
v  a 


whereas  for  our  hybrid  trees  it  is 

i  rN  ,  -  r  N.NS  n 

l0gBBF  ^nJ  logBBF  ^  (r-1)  S  ^  * 

c  t  c  a 

•The  only  difference  in  order  then  is  the  base  of  the  log  function. 
This  simplicity  results  from  choosing  the  last  block  level  of  the 
hybrid  trees  to  contain  key-node  trees. 

It  is  easily  shown  that  BBF  >  BBF  for  all  L,  .  This  means  that 
the  order  of  the  upper  access  bound  for  the  hybrid  trees  is  always 
better  than  that  for  the  key-node  trees  with  the  same  value  of  r. 

For  sufficiently  small  N,  key-node  trees  give  a  better  performance 
due  to  lower  terms  becoming  important.  For  example  it  can  be  seen 
that  the  absolute  minimum  upper  access  bound  is  1  for  the  key-node  trees 
and  2  for  the  hybrid  trees.  However,  even  at  only  2  for  the  upper 
access  bound,  the  maximum  number  of  keys  which  can  be  handled  by  key- 
node  trees  is  less  than  that  for  calculated  trees  for  any  realistic 
values  of  and  L^.  Note  that  this  is  true  even  when  the  hybrid 
binary  tree  is  compared  with  all  r-way  key-node  trees. 

We  have  demonstrated  that  the  upper  access  bound  performance  of 
the  hybrid  r-way  trees  is  superior  to  that  of  r-way  key-node  trees  for 
large  N,  although  not  for  small  N,  in  the  two-level  storage  environment 
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considered.  In  [33]  Coffman  and  Eve  considered  two  types  of  tree 
which,  although  not  the  same  as  the  two  types  considered  here,  have 
similar  node  characteristics.  They  present  mean  path  lengths  for 
the  two  types  of  trees.  Although  this  author  has  not  done  detailed 
calculations  on  the  trees  of  Coffman  and  Eve,  the  present  calculations 
would  suggest  that  their  trees  exhibit  the  same  properties  as  the  two 
types  considered  here,  and  that  their  prefix  trees  would  show  superior 
upper  access  bound  performance  to  their  sequence  trees  for  a  suffic¬ 
iently  large  number  of  keys. 

Choosing  the  key-node  trees  as  the  final  stage  in  the  hybrid  trees 
enables  us  to  simplify  comparison  of  the  two  types  of  tree.  However, 
in  some  sense  it  also  enables  us  to  unify  the  two  types  of  structure 
into  a  more  general  structure  (we  could  have  let  the  final  stage  of 
the  hybrid  trees  be  a  calculated  stage  and  shown  similar  results  to 
the  ones  we  obtained) .  If  we  let  the  limit  of  the  structure  for  small 
N  be  a  single  block  key-node  tree,  then  we  have  a  structure  which 
possesses  the  advantages  of  both  types  of  tree.  Also,  if  this  limit¬ 
ing  key-node  tree  incorporates  a  balancing  mechanism  (see  [25]  for  one 
such  mechanism),  then  it  would  be  possible  to  utilize  the  initial 
balanced  tree,  which  contains  an  input  sample  of  the  total  set  of  keys 
to  be  inserted  (albeit  an  uncertain  one  in  terms  of  representability) , 
to  aid  the  process  of  choosing  suitable  calculation  rules  for  the  key¬ 
less  nodes  of  the  calculation  tree  to  be  formed. 


2 . 3  Multi-Key,  Multi-Record  Retrieval 


Up  to  now  we  have  considered  only  the  case  where  a  single  record  is 
associated  with  a  single  key,  for  instance  in  the  case  of  identity  keys. 

We  now  wish  to  look  at  characteristic  (property)  keys  where 

(a)  many  records  are  associated  with  a  single  key 

(b)  1  a  total  retrieval  may  be  based  on  a  number  of 

characteristic  keys 

(c)  a  record  instance  may  have  simultaneous  multiple 
values  for  a  characteristic,  i.e.,  subject  character¬ 
istic  as  opposed  to  age  characteristic. 

2.3.1  Direct  Sequential  Retrieval 

This  is  the  conventional  method  of  retrieval  where  each  data 
record  is  directly  inspected  to  see  if  it  satisfies  the  retrieval 
conditions  (key  values).  All  records  in  the  file  must  be  inspected. 
This  method  is  unsuitable  for  large  files  where  only  a  small  percentage 
retrieval  is  required. 

2.3.2  Key  Ordered  Sequential  Retrieval 

In  this  case  a  decision  must  be  made  about  the  order  of  import¬ 
ance  of  the  keyed  characteristics.  The  file  is  ordered  on  the  primary 
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characteristic.  For  each  value  of  the  primary  characteristic 
the  records  are  ordered  on  a  secondary  characteristic.  This 
concept  can  be  continued  for  all  keyed  characteristics  but  in 
practice  usually  stops  after  two  or  three. 

As  long  as  retrieval  values  are  given  for  these  "order  keys" 
only  a  small  area  of  the  file  need  be  searched  after  determination 
of  the  start  of  the  particular  primary  key  value  in  file.  With 
this  approach,  batch  input,  ordered  on  primary  key  value,  leads  to 
single  file  scan  up  to  highest  retrieval  value  of  primary  character¬ 
istic.  If  the  characteristic  is  multivalued,  then  one  area  of  the 
file  is  inspected  for  each  primary  characteristic  value  of  a  single 
request.  Having  gone  to  the  trouble  of  arranging  the  file  on  an 
ordered  key  basis  by  characteristic  the  obvious  thing  to  do  is  to 
create  indexes  to  fully  utilise  the  arrangement.  With  the  current 
scheme  an  index  for  the  primary  characteristic  is  feasible. 

We  create  a  table  with  one  entry  for  each  unique  primary 
characteristic  value  giving  the  start  position  of  that  value  in  the 
file.  If  we  assume  a  directly  addressable  file,  then  we  need  only 
inspect  the  areas  corresponding  to  the  required  primary  character¬ 
istic  values.  Note  that  the  table  could  be  set  up  in  various  ways, 
for  example,  as  a  single  table  or  as  a  multi-directory  -  partial 
search  table.  In  the  latter  case  we  could  implement  the  file  as 
two  pseudo-physical  files  (a)  the  table  as  an  index  sequential  file 
providing  pointers  to  (b)  the  data  file  as  a  directly  addressed  file 
with  the  contents  ordered  as  above. 

This  technique  works  reasonably  well  when  one  knows  the  scope 
of  the  requests  to  the  system  and  when  they  are  fairly  standard  with 
respect  to  the  characteristics  (e.g.,  medical  statistics  are  often 
of  this  nature) .  However  in  a  more  flexible  environment  where 
requests  may  involve  any  of  the  characteristics  in  any  combination 
and  where  multivalued  characteristics  are  rife  (Bibliographic 
Retrieval)  this  technique  is  likely  to  involve  extensive  data  record 
search  (as  does  the  normal  direct  retrieval) .  If  we  have  no  real 
primary  characteristic  that  is  nearly  always  utilised  in  requests, 
then  ordering  the  file  is,  in  most  cases,  worthless. 

2.3.3  Set  Representation  of  Multi-Key  Retrieval 

In  the  last  section  we  described  circumstances  under  which  the 
techniques  so  far  described  are  inappropriate.  Before  we  look  at 
some  techniques  which  attempt  to  deal  with  this  environment  we  will 
consider  a  set  representation  of  the  problem  which  is  a  help  in 
analysing  the  position. 

We  have  a  set  of  N  stored  data  entities 

S  e  (e.  ;  i=l,N} 

1 
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where  each  entity  has  the  same  set  of  associated  characteristics 
(attributes) 

C  e  {a.  ;  j»l,nc>. 

If  we  restrict  each  entity  to  having  one  value  for  each  attribute, 
then  we  have  a  set  of  values 


V  e  {v  ;  i=l,  N  ;  j=l,  n  } 

where  v^ .  is  the  value  for  the  attribute  of  the  i^  entity. 

We  have  Arrived  at  the  SHARE  file  definition  (referenced  in  [4]) 
if  we  take  the  set  of  triples 

F  e  ( (ei? a_. ,  vij)  ;  i=l,  N  ;  j=l,nc> 

However,  we  are  allowing  the  possibility  that  an  entity  may 
have  many  values  for  a  single  ^tribute.  We  choose  the  notation 
that  v. .,  corresponds  to  the  k  distinct  value  (over  all  entities) 
for  the^  attribute.  The  i  simply  denotes  a  particular  corres¬ 
pondence  with  an  entity. 


Note  that  there  exists  (for  given  j,  k)  i^. 
We  then  have  the  set  of  values 


i^  such  that  v.  =  v.  .. 

2  ijjk  i2jk 


V™  e  {vijk  ;  j  =  nc  ;  k=1»  mj >  (i=l ,N) > 

th 

where  m.  is  the  number  of  distinct  values  for  the  j  entity.  Our 
file  definition  thus  becomes  the  set  of  triples 


£ 


v.  ., )  ;  i=l,  N  ;  j=l,  n  ;  k=l,  m.}. 
nk;  J  c*  l 


For  each  attribute-unique  value  pair,  there  exists  a  subset  of  S,  call 
it  S.t,  containing  all  e.  ,  e.  such  that  v.  .,=  v.  .,  =  vr...,. 

J  k  l-j^  ±2  12Jk  12-'k 


All  retrieval  sets  SR 
subsets . 


can  be  formed  from  linear  combinations  of  the  S 


jk 


The  operations  which  combine  the  S.^  are  the  normal  set  union, 
intersection  and  exclusion,  i.e.,  ^ 

(A  OR  B) AND  NOT  C 


asks  for  all  elements  which  are  in  A  or  in  B  and  not  in  C. 

This  is  illustrated  by  the  following  figure.  The  sets  are  repre¬ 
sented  by  the  interiors  of  circles  with  common  members  shown  by  over¬ 
laps  . 

B 
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An  obvious  way  then  to  set  up  an  information  retrieval  system 
based  on  this  type  of  requirement  is  to  somehow  isolate. the  differ¬ 
ent  sets  The  two  major  ways  of  doing  this  are  Multilist  and 

Inverted  Lists  (with  hybrids  being  possible) .  We  will  consider 
these  techniques  in  the  next  sections. 

2.3.4  Multilist 


This  technique  involves  the  chaining  of  actual  data  records. 

There  is  one  chain  for  each  attribute  -  unique  value  pair  (or  some¬ 
times  for  each  unique  value  over  all  attributes) .  Hence  the  elements 
of  each  S are  on  the  (j,k)  chain.  The  start  positions  for  each 
chain  are^neld  in  a  directory  (which  given  j,k,  provides  start  data 
record)  together  with  the  length  of  the  chain  (i.e.,  the  number  of 
entities  in  S.^.)  .  Note  that  each  entity  may  lie  on  many  chains. 

Retrieval  is  performed  by 


(1) 


picking  the  S., 
requested 


which  is  of  shortest  length  out  of  the 


(2)  applying  all  other  request  conditions  to  each  member  of 
the  chosen  S . 

Retrieval  is  fast  if  the  S.,  are  small  (which  usually  implies 
a  small  file)  but  not  otherwise^ T3] .  Addition  of  entities  to  the 
file  implies  addition  of  data  records  to  the  chains.  These 
additions  are  made  to  the  beginnings  of  the  chains.  This  is  an 
efficient  operation.  Note  that  the  actual  data  record  is  just  added 
to  the  end  of  the  file. 


Deletions  are  more  of  a  problem  due  to 
.  (a)  hole  problem  in  data  record  file 

(b)  search  problem  in  the  chain.  It  is  necessary  to 

follow  the  chains  from  the  beginning  to  the  required 
deletion  (1  way  chain) . 

With  this  system  we  have  the  option  of  choosing  only  certain 
attributes  for  chaining  and  always  employing  search  for  the  non-chained 
attributes  within  the  appropriate  chain.  Then,  in  some  cases,  requests 
could  involve  direct  sequential  retrieval  of  the  whole  file  (  if  the 
request  contains  only  non-chained  attributes) . 

A  modification  of  the  multilist  structure  exists  called 
"controlled  list  length  multilist"  in  which  each  list  of  multilist 
is  replaced  by  a  number  of  lists  with  a  controlled  maximum  list 
length.  The  list  directory  then  has  the  corresponding  number  of 
list  heads  for  each  key  entry.  The  advantage  of  this  structure  is 
that  it  is  possible  to  retrieve  the  lists  for  each  key  in  parallel. 
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thus  providing  an  improved  compute-input/output  overlap.  Note, 
however,  that  the  same  number  of  input/output  operations  are  re¬ 
quired  as  for  the  pure  multilist. 

Another  modification  of  the  multilist  structure  is  "cellular 
multilist".  In  this  case,  each  key  entry  in  the  directory  con¬ 
tains  one  sub-entry  for  each  cell  containing  one  or  more  records  on 
the  list  for  the  key.  A  cell  is  a  localised  storage  area  or  block. 
In  a  retrieval  involving  several  keys,  inspection  of  the  appropriate 
key  entries  is  sufficient  to  eliminate  from  the  search  a  number  of 
cells  not  fitting  the  retrieval  requirements.  The  length  of  each 
list  for  a  key  in  a  cell  is  stored  with  the  key-cell  subentry. 

Hence,  for  each  cell,  the  shortest  list  can  be  selected  for  the 
search  process.  Note  that  some  cells  selected  in  the  cell  selection 
process  on  the  key  directory  may  result  in  no  records  satisfying  the 
retrieval.  Updating  in  cellular  multilist  is  similar  to  the  pure 
multilist  case  and  we  will  not  consider  it  further. 

Inverted  lists  (which  we  treat  in  section  2.3.5)  and  the  various 
types  of  multilist  are  treated  in  some  detail  in  [3] .  Note  that 
some  errors  exist  in  Table  29  of  this  reference.  In  particular  the 
retrieval  time  for  cellular  serial  is  approximately  13.6  seconds,  not 
3.6  seconds  (assuming  0^=30  in  Table  27,  this  figure  being  almost 
unreadable) . 

2.3.5  Inverted  Lists 


As  in  multilist,  the  S.^  are  directly  utilised  but  in  this  case 
we  have  a  more  directory-ori Jnted  system. 

We  hold  the  set  of  data  record  addresses  for  each  S.^  as  a  conti¬ 
guous  list  (logically)  so  that  our  directory  provides  thi  (corres¬ 
ponding)  list  given  the  attribute-unique  value  pair.  A  retrieval 
request  thus  involves  obtaining  the  appropriate  record  address  lists 
and  performing  the  set  operations  on  these  lists.  Note  that  the 
lists  are  ordered  since  otherwise  the  set  operations  become  very  time 
consuming . 

A  final  list  results  which  contains  the  addresses  of  all  data 
records  satisfying  the  request.  There  are  two  common  arrangements 
of  the  directory  (a)  one  directory  file  per  attribute  (b)  one 
directory  file  for  all  attributes.  In  both  cases  the  general  "direc¬ 
tory  record"  format  is:  identifier,  (e.g.,  attribute-unique  value  in 
case  (b))  number  of  addresses  in  list,  the  list  of  addresses. 

This  setup  gives  good  access  times  for  large  files  but  updating 
is  more  involved  and  costly.  This  mainly  comes  from  (i)  the  variable 
length  directory  records  which  may  individually  vary  in  size  with  up¬ 
dating,  leading  to  the  normal  storage  allocation  problems,  (ii)  the 
ordering  of  the  addresses  in  the  lists.  Note  that,  as  with  multilist, 
we  may  choose  to  have  only  some  of  the  attributes  inverted  and  the 
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final  record  retrieval  stage  would  then  involve  search  on  each  record 
retrieved  regarding  the  non-inverted  attribute  values.  - 

As  is  usual,  we  have  a  choice  of  pseudo-physical  files  in  which 
to  represent  our  directory.  For  example  we  could  utilise  a  table 
structure  such  as  those  we  described  in  2.2.1  where  each  key  would 
be  an  attribute  value  (case  (a)  directory  per  attribute)  and  the  data 
records  are  the  variable  lists  of  addresses. 

This  variability  leads  to  several  more  options  in  the  data 
record  storage.  For  example,  assign  a  number  (varying)  of  fixed 
size  blocks  to  each  list.  The  table  pointers  then  indicate  the 
first  block.  The  subsequent  blocks  may  either  be  contiguous  in 
storage  (implying  more  shifting  at  updates)  or  chained  (implying 
probably  longer  access  times) . 

Returning  to  the  general  inverted  structure  there  is  one  important 
point  concerning  the  lists  of  record  addresses.  The  option  exists 
that  these  refer  to  actual  addresses  or  to  some  unique  entity  identi¬ 
fier  (such  as  accession  number  or  record  number) .  In  the  later  case 
a  mechanism  is  required  to  map  from  identifier  to  storage  address. 
However,  update  problems  (due  to  records  shifting  their  addresses)  are 
eased,  since  a  shift  in  record  involves  the  update  of  one  pointer  only, 
rather  than  a  number  (possibly  large)  of  pointers  with  unknown  positions 
in  the  lists.  This  use  of  indirection  is  particularly  important  in 
file  handling  and  analogous  cases  appear  in  many  file  handling  problems 
especially  with  highly  linked  structures.  Note  that  this  indirection 
can  be  handled  either  explicitly,  the  identifier  to  storage  address 
mapping  is  a  separate  file  or  implicitly,  the  mapping  is  internal  to 
the  data  record  file. 
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CHAPTER  5 

Environmental  Requirements 
on  File  Handling  Characteristics 


3.1  Introduction 


Up  to  now,  in  our  development,  we  have  looked  at  our  file  facilities 
from  an  internal  structural  viewpoint.  That  is,  given  a  file  facility 
operation  we  have  broken  it  down  into  the  corresponding  internal  file 
operations.  Using  this  approach  we  have  introduced  fairly  complex  files 
which  we  are  able  (to  some  extent)  to  analyse  as  to  their  performance. 

Now  we  must  turn  about  face  and  begin  to  look  at  the  external  environment 
in  which  these  files  will  operate  in  order  to  see  what  requirements  the 
environment  makes  on  the  file  facilities.  What  we  seek  here  is  not  exact 
rules  which  say,  for  this  exactly  specified  environment  this  file  system 
is  optimal  -  not  that  this  is  not  a  worthwhile  aim.  There  are  problems 
however  in  that  (a)  we  rarely  know  the  exact  details  of  the  environment 
and  (b)  what  is  optimal  is  open  to  question,  especially  when  we  consider 
the  human  factors  of  the  total  problem.  We  often  have  a  trade-off  bet¬ 
ween  efficiency  and  facility.  What  we  are  looking  for  here  are  the  res¬ 
trictions  or  bounds  forced  on  the  files  by  the  environment,  so  that  we  may 
say  certain  file  facilities  are  excluded  under  certain  conditions. 

3.2  The  Human  Interface 


We  are  interested  in  a  computer  facility  which  is  utilised  directly 
by  humans.  We  ignore  the  indirect  human  to  human  to  facility  interface 
altogether.  The  basic  characteristics  of  the  interaction  are  user  job- 
task  size  (general  term)  and  required  response  time.  A  job-task  here 
is  considered  to  be  a  work  unit  for  the  facility  which  involves  no  human- 
user  interaction.  Two  examples,  from  the  extremes  of  what  is  basically 
a  continuum  are  (a)  a  large  program  doing  a  specified  job  which  is  run  on 
the  computer  without  user  interaction  and  with  the  results  returned  in 
hours  (or  days) .  (b)  a  single  simple  request  via  a  terminal  (typewriter) 

with  the  response  within  a  few  seconds.  In  between  are  such  things  as 
batched  small  programs  returned  (within  a  few  minutes?)  via  a  terminal 
(card  reader-line  printer  link),  e.g.,  WATFOR.  Since  we  are  interested 
in  file  handling  problems  this  later  case  is  not  so  relevant,  although  a 
similar  requirement  could  be  necessary  for  small  file  handling  jobs. 

One  further  factor  which  is  often  critical  is  job-task  cost  by  which 
we  mean  computer  resource  cost.  This  relates  to  the  efficiency  of  oper¬ 
ations  and  although  we  do  not  necessarily  require  optimal  efficiency 
(remember  efficiency-facility  tradeoff)  we  certainly  require  reasonable 
(judgement  value)  efficiency.  We  require  this  regardless  of  environment 
which  is  one  reason  why  we  treated  the  internal  structural  properties  of 
file  facilities  separately  from  the  environment.  The  large  off-line 
submitted  job- task  user  may  be  just  as  concerned  about  the  efficiency  of 
individual  task  requests  as  is  the  small  on-line  job-task  user  although 
in  some  cases  the  former  requires  efficiency  averaged  over  a  number  of 
requests  rather  than  over  one  (not  necessarily  the  same) . 
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3 . 3  The  Off-Line  Large  Job-Task  Environment 

In  this  case  we  have  our  large  job-tasks  run  to  completion  or  error. 
We  therefore  have  (except  for  exceptional  cases)  no  human  requirement 
for  specific  request  file  response.  We  do  have  the  requirement  for 
global  efficiency.  We  thus  have  the  possibility  of  batching  our  file 
requests  in  such  a  way  that  a  batch  of  requests  are  obtained  by  one 
efficiently  arranged  complex  request.  This  type  of  strategy  favours 
search  as  opposed  to  directory  techniques  in  that  the  cost  of  a  batch 
may  be  very  little  more  than  the  cost  of  a  single  request  (see  2.1  to 
2.2.1,  especially  note  on  ordering  cost). 

We  arrange  our  requests  so  as  to  enable  a  "sequential  path"  to  be 
followed  through  the  file.  For  large  N  we  may  be  able  to  arrange 
"sequential  paths"  through  subsections  of  the  file,  hence  a  usefulness 
of  index  sequential  type  organisations.  Note  that  if  any  task  requires 
inspection  of  all  or  most  of  the  entities  in  a  file,  as  for  example  a 
statistics  task,  then  a  pure  unordered  sequential  organisation  is  the 
most  effective. 

Even  with  an  inverted  file  structure  with  separate  indexes  from  the 
data  records,  it  may  be  cheaper  to  batch.  For  example,  the  indexes  are 
inspected  for  the  batch  and  the  sets  of  records  determined  for  the  res¬ 
ponses.  The  record  numbers  are  then  ordered  and  a  single  pass  made  of 
the  data  record  file  to  obtain  the  required  data  records,  which  are  then 
reorganised  for  the  responses.  This  type  of  strategy  is  particularly 
suitable  for  a  mixed  device  file  where  direct  access  storage  is  too  small 
to  hold  the  entire  file.  The  indexes  are  held  on  direct  access  and  the 
data  records  on  tape.  The  single  pass  of  the  tape  per  batch  is  then  an 
attractive  feature. 

The  strategy  may  be  suitable  even  for  a  direct  access  file  if  the 
record  numbers  are  allowed  to  match  the  device  track  and  arm  movements 
(under  dedicated  conditions) . 

Some  large  job-tasks  are  unable  to  batch  their  requests.  This  is 
particularly  true  of  highly  linked  file  structures  where  link  tracing  is 
involved.  Under  these  conditions  quick  individual  access  files  may  be 
required. 

One  very  important  characteristic  of  the  usage  environment  involves 
the  access-update  usage  balance.  It  is  this  factor  which  determines 
probably  more  than  any  other  which  file  organisation  to  utilise. 

Some  other  characteristics  of  off-line  systems  are: 

(a)  Facilities  usually  embedded  in  a  general  purpose  higher  level 

language  since  all  eventualities  must  be  taken  into  account 

(without  intervention  actions) .  There  is  thus  a  tendency 
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towards  complex  facilities  with  many  options. 

An  opposing  type  of  facility  is  sometimes  seen  where  users 
submit  simple  requests  (forms  or  data  cards)  to  controller 
(human)  of  a  multi-user  batched  system,  e.g.,  off-line  batch 
library  catalogue  system. 

(b)  Backup  usually  handled  on  a  regular  interval  complete  dump 
basis  and  ability  to  regain  current  position  by  either  (a) 
saving  of  "transactions"  (b)  rerun  of  tasks.  Since  response 
is  usually  not  vital,  this  method  usually  works  quite  well, 
although  selected  dumping  is  sometimes  employed,  e.g.,  dump 
changed  portions  only.  Note  possible  high  cost  for  large 
files . 

(c)  Update  usually  batched  -  sometimes  leading  to  file  state  phasing 
problems,  e.g.,  (i)  junk  accumulation  which  is  unremovable  due 

to  lack  of  facilities  provided  (ii)  problems  caused  by  information 
getting  out  of  order  in  an  ordered  file. 

3.4  The  Interactive  Small  Job-Task  Environment 

Here  the  emphasis  is  usually  on  providing  a  response  to  an  on-line 
user  in  a  small  period  of  time.  Hence  files  are  oriented  towards  quick 
access.  Since  update  and  access  are  opposing  quantities  requiring  the 
trade-off  approach,  some  on-line  systems  do  their  updating  off-line. 

This  is  particularly  likely  with  highly  structured  and  indexed  files 
where  updating  is  an  extremely  complex  task.  It  is  also  likely  when 
obtaining  precise  up-to-the-minute  information  is  not  an  essential  require¬ 
ment.  Some  systems  such  as  airline  reservation  systems  do  their  updating 
on  line,  but  most  updates  do  not  radically  affect  the  indexing  structures, 
rather  the  data  contents.  Complete  off-line  updating  of  on-line  systems 
can  lead  to  the  same  problems  we  had  in  the  total  off-line  environment. 
However  there  is  the  possibility  of  simulated  on-line  update  in  several 
degrees .  1 

For  instance  an  on-line  update  request  is  held  in  a  separate  file 
from  the  main  file  with  either  (a)  notification  of  update  to  new  updaters 
only  (b)  notification  of  updates  to  updaters  and  enquirers  (c)  co-ordinated 
access  to  main  and  update  files  operating  like  a  fully  on-line  updated 
file.  In  all  cases  periodic  off-line  updating  is  required  to  combine  the 
update  and  main  files. 

With  this  type  of  system  on-line  updaters  can  check  the  successful 
inclusion  of  their  update  although  strictly  speaking,  the  checks  should 
be  repeated  after  each  off-line  reorganisation. 

This  above  discussion  brings  out  another  feature  of  this  type  of 
environment.  There  are  usually  different  types  of  users  at  different 
levels  as  follows: 

(a)  Query  level.  Simple  language  facilities  in  an  easy  to  use 
form  with  some  prompting. 
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(b)  Update  1  evd.  Still  easy  to  use.  More  facilities  but  still 

not  general. 

(c)  Maintenance  level.  More  general  enabling  file  reorganisation 

and  systems  programming  maintenance. 

Note  that,  under  conditions  caused  by  many  small  tasks  time-sharing  from 
many  terminals,  a  small  batch  policy  may  be  employed  where  (a)  overall 
efficiency  gains  are  high  and  (b)  the  tasks  are  small  enough  that  response 
time  requirements  can  still  be  met. 

In  some  conditions,  applications  involving  large  files  may  cause 
even  small  tasks  to  stretch  the  system  in  terms  of  providing  required 
response  time.  Great  care  has  then  to  be  taken  in  evaluating  the  feasa- 
bility  of  such  systems  for  a  particular  machine  configuration. 

This  brings  up  an  important  philosophy  which  results  from  the  fact 
that  a  given  total  system  is  only  capable  of  handling  job-task  requests 
of  a  certain  size  under  normal  loading.  Hence  some  operations  cannot 
be  allowed  on-line.  An  example  would  be  a  request  involving  large 
scanning  amounts  or  possibly  large  output.  Note  that  on-line  initiation 
should  be  possible  but  quick  response  cannot  be  achieved.  Several  alter¬ 
natives  again  exist.  Two  are  initiate  on-line  causing  (a)  off-line  action 
with  output  returned  via  normal  line  printer  methods,  etc.,  (b)  scheduled 
on-line  slow  response  action  with  message  leaving  for  requester  which  he 
can  obtain  on-line  (not  suitable  for  large  output  tasks) .  Backup  tends 
to  be  neglected  in  this  environment  and  operates  as  in  off-line  environment. 

3 . 5  Security  and  Privacy 

These  are  difficult  problems  which  we  will  not  cover  other  than  to 
make  a  few  brief  comments. 

Security  is  protection  from  error.  Some  software  oriented  protection 
can  be  provided  by  the  inclusion  of  read/write  locks  which  can  be  set  and 
unset  by  user  or  software  action  under  controlled  conditions  (see  4.2  for 
further  discussion) . 

Total  security  is  not  possible,  reasonably  effective  security  is 
possible.  In  any  event  suitable  backup  must  be  kept  according  to  the 
environmental  needs  to  enable  re-initiation  or  recovery. 

Privacy  is  protection  from  unauthorized  access  and  also  from  "mis-use" 
of  data  by  authorized  personnel. 

The  former  can  be  eased  by  protection  facilities  which  (a)  allow  only 
authorized  people  or  programs  into  the  system  (b)  allow  only  authorized 
people  or  programs  past  the  data-locks  at  gross  or  even  item  level. 

The  following  references  should  be  studied:  [9],  [45],  [46].  One 
point  is  clear,  once  data  is  held  in  a  system  there  can  be  nu  guarantee  of 
absolute  privacy.  The  best  that  can  be  done  is  to  ensure  that  the  cost  of 
illegally  obtaining  data  is  prohibitive.  This  puts  a  good  case  for  the 
point  of  view  that  some  special  data  should  never  be  held  in  an  identi¬ 
fiable  system. 
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CHAPTER  4 

Some  Technical  Factors  in  Filing  Systems 

4 . 1  Implications  of  Dedicated  versus  Shared  (Auxiliary)  Storage 

Dedicated  storage  is  where  one  task  (process)  can  arrange  to  have 
exclusive  access  to  a  storage  area  for  a  period  of  time  which  ranges 
from  a  group  of  specified  accesses  (in  a  program,  maybe  a  whole  program) 
to  permanent  dedication.  Shared  storage  is  the  opposite  of  this  where 
access  is  shared  by  several  tasks  (processes) .  In  this  case  all  access 
requests  are  queued  and  a  scheduling  algorithm  used  to  determine  order 
of  servicing  of  requests.  Examples  of  such  algorithms  are  FIFO  (first 
come  first  served)  and  shortest  distance  first.  In  both  cases,  in  a 
multiprogramming  system,  the  access  requests  of  different  tasks  (processes) 
will  be  intermixed  for  servicing.  The  main  implication  here  is  that  in 
dedicated  storage  one  knows  where  the  device  access  mechanism  is  relative 
to  the  storage  media  and  can  take  advantage  of  this.  Examples  of  this 
are  (a)  laying  sequentially  accessed  data,  cylinder  by  cylinder  on  a  disc 
thus  reducing  head  movement,  (b)  putting  the  directory  table  of  a  one 
level  table  access  mechanism  in  the  middle  of  the  data  set  on  a  disc 
(rather  than  at  one  end)  thus  reducing  average  swing  from  table  (always 
accessed)  to  data  records. 

All  these  types  of  arrangement  depend  on  not  allowing  another  task 
(process)  to  cause  an  unknown  arm  movement  (or  even  interfer  with  inter¬ 
track  pickup  timings),  thus  upsetting  the  basis  for  the  arrangement. 

On  the  other  hand  in  shared  storage  assuming  even-access  distribution 
on  the  storage,  one  can  obtain  optimum  access  conditions  over  all  tasks  by 
choosing  algorithms  such  as  shortest  distance  first.  This  is  done  at  the 
possible  expense  of  some  individual  tasks. 

In  some  cases  individual  degradation  can  be  severe.  Compare  track- 
by-track  sequential  access  of  possibly  1  msec  (and  certainly  25  msecs)  with 
average  random  access  of  87.5  msecs.  (2314  disc  pack.)  This  in  itself 
can  be  bad  enough  for  the  non-trivial  file  user.  However  it  could  also 
lead  to  total  system  degradation  if  the  system  is  I/O  bound.  For  example 
if  we  have  two  or  more  sequential  file  processing  tasks,  all  I/O  bound, 
the  extra  cost  per  individual  I/O  access  cannot  be  absorbed  by  other  jobs 
using  the  CPU,  and  the  I/O  bottleneck  is  worsened.  The  assumed  random 
distribution  of  access  (time  and  space)  is  just  not  correct  under  this 
type  of  job  load  and  so  we  have  lost  out  all  round. 

This  suggests  that  scheduling  algorithms  should  be  designed  to  some 
extent  for  the  job  load  and  that  maybe  special  jobs  should  claim  dedication 
from  the  scheduler  over  certain  periods  (problems  here  in  response  for 
other  jobs).  For  the  large  storage  user  the  solution  may  be  to  ensure 
that  he  occupies  complete  storage  media  units  (disc  packs) . 
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4 . 2  Uninterru  ^able  Operations 

This  is  a  problem  which  so  far  has  not  received  much,  if  any, 
attention.  We  state  it  as  follows.  The  presence  of  highly  complex 
files  has  resulted  in  file  facility  operations  which  involve  a  number 
of  internal  file  operations  for  their  execution.  Immediately  before 
and  after  such  a  facility  operation,  the  file  is  in  a  recognised  legal 
state  and  is  capable  of  operating  correctly  under  the  given  set  of 
facility  operations.  However,  during  the  execution  of  a  facility  oper¬ 
ation,  the  file  takes  on  a  series  of  temporary  transition  states. 

If  left  in  one  of  these  transition  states,  the  file  may  not  operate 
correctly  under  the  facility  operation  set.  Unfortunately  with  complex 
files  the  length  of  the  transition  periods  are  sufficiently  large  as  to 
seriously  increase  the  chance  of  accidental  interruption  during  one  of 
these  periods,  leading  to  a  faulty  file  state  and  troublesome  error 
detecting.  For  example  in  the  large  job-task  off-line  environment  the 
job  may  exceed  its  time  limit  during  one  of  these  periods.  It  is  poss¬ 
ible  for  errors  such  as  these  to  go  undetected  for  some  time,  especially 
if  the  file  will  operate  correctly  under  most  usage  conditions.  This 
same  point  arises  if  we  wish  to  time-share  file  usage.  Extreme  care  is 
needed  to  ensure  that  such  sharing  is  carried  out  in  a  logically  consis¬ 
tent  way.  This  particular  problem  can  be  eased  by  suitable  protection 
locks  on  files  (and  parts  of  files)  but  still  it  is  no  use  cutting  off  a 
task  at  the  end  of  a  time  slice  if  all  (or  most  or  even  a  few)  tasks  will 
themselves  be  held  up  by  the  locks  (the  author  can  imagine  queuing  horror 
conditions  here) .  Hence,  in  file  handling  systems  there  is  a  strong 
case  for  specified  small  task  completion  time-sharing  rather  than  fixed 
time  time-sharing.  We  thus  have  the  concept  of  un interrupt able  oper¬ 
ations  which  is  fairly  analogous  to  the  indivisable  operations  in  oper¬ 
ating  systems  and  hardware.  Depending  on  many  factors,  these  uninterr- 
uptable  operations  may  correspond  to  the  file  facility  operations  or  to 
specially  chosen  parts  of  these  facility  operations. 


4 . 3  The  Role  of  Virtual  Memory  in  the  Handling  of  Application  Files 

The  usefulness  and  applicability  of  the  virtual  memory  concept  to 
problems  normally  involving  the  handling  of  explicit  application  files 
is  an  important  subject  for  research.  Reference  [47]  reports  the 
present  authors  views  on  this  subject.  The  paper  attempts  to  set  out 
the  facts  regarding  the  performance  of  virtual  memory  with  regard  to 
data  handling  and  to  suggest  an  approach  by  which  virtual  memory  may  be 
used  satisfactorily  by  an  explicit  file  handling  subsystem. 


51 


CHAPTER  5 


Consideration  of  an  Existing  Filing  System 


5.1  Introduction 


The  aim  of  this  chapter  is  to  consider  an  existing  filing  system 
with  regard  to  what  facilities  are  provided  to  the  user  and  what  can 
be  expected  in  terms  of  the  performance  of  the  system.  All  information 
about  the  system  has  been  obtained  from  published  documents.  The  con¬ 
tents  of  this  chapter  represents  the  author's  interpretation  of  the 
inspected  publications  and  the  author  apologizes  for  any  errors  that  may 
exist  here. 


5 . 2  The  TPMS  Data  Management  System 

5.2.1  Background 

This  system  was  developed  by  SDC  and  was  first  demonstrated 
in  1968.  It  runs  on  the  ADEPT  time  sharing  operating  system  which 
runs  on  IBM  360  machines  from  the  50  upwards.  It  is  one  of  the 
few  data  management  systems  that  utilises  a  file  structure  which 
takes  good  advantage  of  the  characteristics  of  direct  access  auxil¬ 
iary  storage  media  [8] .  It  is  also  orientated  towards  "non-programmer" 
users  interacting  on-line  with  the  system  via  typewriter  terminals. 

It  was  claimed  in  1968  that  a  response  time  of  5  seconds  was  normal 
for  a  load  of  25  users. 

5.2.2  The  Facility  Modules 


The  user  facilities  are  divided  into  six  distinct  modules  plus 
various  control  facilities.  The  modules  are  considered  independ¬ 
ently  in  later  sections  of  this  chapter.  Here  we  summarize  their 
purposes . 

(1)  Define  -  enables  a  new  data  base  to  be  defined  by  the  speci¬ 
fication  of  the  data  structures  for  the  data  base.  It  is 
also  possible  to  modify  the  description  of  an  existing  data 
base . 

(2)  Load  (or  Generate)  -  creates  a  data  base  given  its  description 
and  the  data  to  be  input. 

(3)  Query  -  enables  users  to  query  a  data  base  in  order  to  obtain 
information  from  it. 

(4)  Update  -  enables  a  user  to  update  on-line  the  contents  of  a 
data  base.  Updating  here  corresponds  to  deleting,  adding 
and  modifying  the  values  of  selected  elements  in  the  data 
base . 
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(5)  Maintain  -  an  off-line  facility,  initiated  on-line,  to  per¬ 
form  substantial  maintenance  operations  on  a  data  base. 
These  operations  include  the  merging  of  data  bases  and  the 
performance  of  update  operations  which  are  too  time  con¬ 
suming  for  on-line  update. 

(6)  Compose  (and  Produce)  -  This  is  a  two  part  facility  for  the 
generation  of  off-line  reports  from  a  data  base.  The  two 
parts  are  on-line  report  specification  and  off-line  report 
production  (initiated  on-line) . 

5.2.3  The  Data  Base  Structures  and  the  Define  and  Load  Modules 


The  basic  major  unit  in  a  data  base  is  termed  a  logical  entry 
and  corresponds  to  some  entity  such  as  one  employee  of  a  firm. 

A  data  base  can  only  have  one  logical  entry  structure  but  it  is 
possible  to  arrange  for  several  types  of  entry  to  occur  within  the 
single  structure  as  we  will  see  later. 

The  components  of  a  logical  entry  are  either  elements  or 
repeating  groups .  An  element  is  a  unit  which  corresponds  to  a 
basic  data  value  and  cannot  consist  of  other  components.  A  ”sub- 
element"  facility  exists  which  enables  a  defined  portion  of  an 
element  to  have  a  name  and  value.  A  repeating  group  is  a  component 
consisting  of  a  group  of  subordinate  components.  Data  is  assigned 
to  elements  but  not  to  repeating  groups.  The  only  data  types  per- 
missable  are;  NAME  (alphanumeric  constants),  NUMBER  (numeric  con¬ 
stants)  ,  DATE  (dates) .  All  components  are  given  unique  numbers 
which  are  user  specified. 

The  terms  logical  entry,  repeating  group  and  element  corres¬ 
pond  to  the  terms  record,  group  (with  multiplicity),  and  field,  of 
section  1.5  of  these  notes.  We  will  use  the  account  file  (slightly 
extended)  of  that  section  to  illustrate  the  facilities  provided  by 
TDMS.  The  DEFINE  module  enables  new  data  bases  to  be  defined  and 
existing  data  base  descriptions  to  be  modified. 

We  will  look  now  at  the  definition  of  new  data  bases.  Each 
data  base  is  associated  with  a  file .  The  file  must  be  identified 
according  to  the  needs  of  the  operating  system.  For  the  ADEPT 
operating  system  this  requires  a  file  name  and,  optionally,  inform¬ 
ation  on  the  storage  volume,  status  of  the  file,  etc. 

Definition  is  triggered  by  the  command  DEFINE. 

The  user  must  then  specify  the  data  base  name  and  the  entry 
terminator  which  signifies  the  end  of  each  entry  occurrence  in  the 
input  data.  A  sample  entry  description  follows: 

1  ACCOUNT  NUMBER  (NAME) 

2  HOLDERNAME  (RG) 

3  SURNAME  (NAME  IN  HOLDERNAME) 
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4  INITIALS  (NAME  IN  2) 

5  DEPOSITS  (RG) 

6  D-DATE  (DATE  IN  5) 

7  D-AMOUNT  (NUMBER  IN  5) 

8  WITHDRAWALS  (RG) 

9  W-DATE  (DATE  IN  8) 

10  W-AMOUNT  (NUMBER  IN  8) 

11  BALANCE  (NUMBER) 

This  corresponds  to  our  account  file  of  section  1.5. 

The  user  may  also  place  optional  restrictions  on  the  format 
and  values  corresponding  to  individual  elements. 

The  VALUE  clause  specifies  sets  of  allowed  values  for  elements. 
These  may  be  ranges  of  values  (AGE  elements  will  be  positive  and 
less  than,  say  120)  or  a  number  of  individual  values.  This  later 
case  is  useful  for  specifying  a  number  of  types  of  entry  within  one 
logical  entry  structure  (and  hence  within  one  data  base)  . 

e.g.,  1  ENTRY  TYPE  (NAME)  VALUES  ARE  SUMMARY,  DEPOSITS, 

WITHDRAWALS 

where  different  sets  of  elements  are  associated  with  the  different 
data  values.  This  is  essentially  a  user  aid,  in  that  data  for  an 
entry  occurrence  can  consist  of  any  set  of  elements  which  the  user 
wishes  to  supply  and  the  only  advantage  of  the  type  element  is 
that  all  entry  data  occurrences  commence  with  the  appropriate  type 
value.  Note  that  entry  occurrences  are  self-defining  in  that  each 
occurrence  specifies  which  elements  have  existing  values  for  that 
occurrence . 

The  FORMAT  clause  may  be  used  to  specify  the  legal  format  of 
input  data.  This  is  an  error  check  facility.  The  UNIQUE  clause 
may  be  used  to  specify  that  no  two  occurrences  of  an  element  can 
have  the  same  data  value. 

In  the  redefinition  facility  the  operations  of  adding,  deleting 
and  replacing  (utilizing  old  component  number)  components  are  pro¬ 
vided.  We  will  not  discuss  this  facility  any  further. 

The  Load  (or  Generate)  module  puts  entry  occurrence  data  into 
a  data  base  given  the  data  base  description.  The  correspondence 
between  the  data  and  the  description  is  obtained  by  having  the  data 
presented  as  component  number  -  data  value  pairs  plus  the  inclusion 
of  appropriate  repeating  group  component  numbers.  An  example  of 
input  data  for  our  account  data  base  is 

1)  106721  2)3)  SMITH  4)  JH 

5)  6)  07/08/69  7)  100.62 

5)  6)  09/08/69  7)  43.00 

11)  143.62 

END 
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1)  182641  2)  3)  BROWN  4)  PR 

5)  6)  08/08/69  7)  150.32 

8)  9)  09/08/69  10)  100.00 

11)  50.32 

END 

Note  that  this  module  is  for  initial  data  base  creation  and  not 
for  updating.  Note  also  that  input  data  may  be  presented  from 
an  existing  file  (of  the  correct  format)  or  interactively  from 
an  on-line  terminal. 

5.2.4  The  Query  Module 

This  is  an  on-line  facility  which  is  aimed  at  the  non-programmer 
user.  We  will  concentrate  on  the  retrieval  statements.  Other 
statements  exist  to  handle  output  format  on  the  terminal,  display 
data  base  descriptions  and  other  information,  and  perform  various 
control  functions. 

The  SHOW  command  retrieves  various  values  of  a  specified  ele¬ 
ment.  The  options  available  include  the  retrieval  of  the  first 
and  last  values,  values  within  ranges  and  a  number  of  values  from 
a  specified  base  value. 

The  PRINT  command  retrieves  and  prints  values  of  selected 
elements,  expressions  involving  elements,  and  other  system  oriented 
quantities  with  which  we  will  not  concern  ourselves.  Also  quali¬ 
fying  clauses  can  be  used  to  select  occurrences  from  the  data  base. 
The  main  clause  to  note  is  the  WHERE  clause.  An  example  of  a  quali¬ 
fied  PRINT  statement  is 

PRINT  ACCOUNT  NUMBER,  BALANCE  WHERE  BALANCE  LQ  5 

which  will  print  out  the  Account  Number  and  Balance  values  for  all 
entries  where  the  balance  is  less  than  or  equal  to  5  units.  The 
relational  and  logical  operators  available  are  similar  to  those 
provided  in  languages  such  as  FORTRAN  IV  and  ALGOL,  as  are  the 
arithmetic  expression  facilities. 

Note  that  certain  statistical  functions  such  as  average,  count, 
etc.,  are  available  and  are  applied  to  the  specified  data  base  sub¬ 
set. 

5.2.5  The  Update  Module 

This  module  provides  on-line  update  facilities  enabling  exist¬ 
ing  data  values  to  be  changed,  new  data  values  to  be  added,  and  old 
data  values  to  be  deleted.  For  example  entire  logical  entries  can 
be  inserted.  A  possible  task  for  our  account  file  could  be  the 
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addition  of  a  new  deposit  occurrence  to  an  account  record  occur¬ 
rence,  followed  by  a  recalculation  of  the  account  balance,  followed 
by  a  change  to  the  stored  account  balance. 

The  WHERE  clause  and  the  qualifying  clauses  are  still  avail¬ 
able  for  the  UPDATE  commands,  thus  allowing  updates  to  be  performed 
on  subsets  of  the  data  base. 

The  update  statements  provided  are  SET,  ADD  and  REMOVE.  Note 
that  the  system  displays  the  effects  of  statements  and  awaits  veri¬ 
fication  prior  to  actual  execution.  Thus  it  is  possible  to  cancel 
statements  if  a  specification  error  has  been  made.  Examples  of  the 
three  statements  types  are 

SET  BALANCE  =  0  WHERE  ACCOUNT  NUMBER  EQ  102347 

ADD  DEPOSITS,  D-DATE=10/08/69 ,  D-AM0UNTS=5 . 00 

WHERE  ACCOUNT  NUMBER  EQ  104927 

REMOVE  ENTRY  WHERE  ACCOUNT  NUMBER  EQ  105621 

Note  that  the  word  ENTRY  must  be  present  in  the  statements  ADD 
and  REMOVE  when  concerning  complete  entries. 

5.2.6  The  Maintain  Module 

This  module  consists  of  two  independent  utilities  together 
with  six  more  which  have  been  planned  but  not  implemented.  We 
will  only  consider  the  first  two  which  are  Batch  Update,  and  Copy  and 
Cleanup . 

The  Batch  Update  facility  has  similar  statements  to  the  Update 
Module  but  is  designed  for  updates  which  are  too  time  consuming  for 
on-line  execution.  Note  that  it  is  possible  to  store  the  batch 
update  statements  and  edit  them  on-line. 

t 

The  Copy  and  Cleanup  facility  simply  copies  the  data  base  from 
one  storage  area  to  another  allowing  a  certain  amount  of  distributed 
storage  space  to  ease  the  problem  of  handling  updates.  Compare  with 
the  spare  space  policy  in  the  update  cost  example,  case  4  of  section 

2.2.1. 

5.2.7  The  Compose  Module 

This  consists  of  two  parts.  A  facility  to  compose  report  des¬ 
criptions  on-line  and  a  facility  to  produce  reports  off-line  by  on¬ 
line  initiation. 

The  QUALIFY  and  SORT  statements  apply  to  an  entire  report. 

The  QUALIFY  statement  is  used  to  select  a  subset  of  the  data  base 
for  the  report.  The  WHERE  clause  is  used  in  the  usual  way.  The 
statement  may  specify  whole  entries  or  components.  In  the  former 
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case  the  selected  subset  will  consist  of  a  number  of  complete 
entries.  In  the  latter  case  certain  sections  of  entries, 
those  sections  associated  with  the  non-qualifying  element 
values,  are  excluded.  Whole  entries  are,  of  course,  excluded 
if  no  suitable  values  exist  for  the  qualified  element. 

The  SORT  statement  enables  a  number  of  specified  elements 
to  be  sorted  on  the  basis  of  the  order  given  in  the  statement. 
For  example 

SORT  BALANCE  SURNAME 

sorts  first  on  BALANCE  and  then  within  each  balance  value  sorts 
on  SURNAME  value. 

The  lack  of  a  SORT  statement  implies  the  retention  of  the 
natural  ordering  of  the  data  base. 

Facilities  exist  to  compose  titles  and  headings  and  to  give 
user  control  of  the  report  format  which  is  normally  dealt  with 
automatically. 

The  CONTENT  statement  is  used  to  specify  which  elements  are 
to  be  printed  in  the  report.  A  WHERE  clause  may  be  used  to 
qualify  the  CONTENT  statement  in  the  usual  way.  An  example  of 
a  CONTENT  statement  is 

CONTENT  IS  ACCOUNT  NUMBER,  BALANCE  WHERE  BALANCE  LQ  5.00 

The  report  will  then  print  out  the  two  element  values  for  each 
entry  with  a  BALANCE  less  than  or  equal  to  5.00. 

It  is  also  possible  to  leave  variables  in  the  report  des¬ 
cription  by  using  the  special  symbol  K.  For  example  we  replace 
the  5.00  above  by  K.  Then  at  report  production  initiation  the 
system  asks  for  a  value  as  follows 

BALANCE  =  :  5.00 

where  the  user  reply  is  shown  underlined. 

Finally  it  should  be  noted  that  the  report  description  can 
be  edited  on-line  as  can  all  stored  descriptions. 

5.2.8  The  File  Structure 


The  File  Structure  of  TDMS  is  basically  an  inverted  file  on 
all  elements  with  occurrences  being  self -defined  in  the  sense  that 
they  specify  which  elements  have  existing  values. 
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The  total  structure  can  be  separated  into  four  parts;  a  data 
structure  directory,  inverted  list  structure,  a  set  occurrence 
directory  and  the  set  occurrences. 

(a)  The  data  structure  directory  contains  the  complete  logical  entry 
description  for  the  data  base.  The  section  we  are  concerned 
with  is  the  "CDEFINA"  table  which  is  a  table  of  component  number 
(and  component  name)  against  a  pointer  into  the  inverted  list 
structure . 

(b)  The  "CDEFINA"  table  references  the  "CVALUES"  table  which  is  a 
table  containing  all  existing  values  for  each  element.*  The 
values  for  each  element  are  kept  in  order  and  referenced  through 
an  index  which  specifies  the  highest  value  in  each  storage 
block.  With  each  value  in  the  "CVALUES"  table  is  a  pointer  to 
the  "CENTS"  table.  The  "CENTS"  table  consists  of  the  lists  of 
set  numbers  with  one  list  for  each  element  name-value  pair.  A 
set  is  an  occurrence  of  a  repeating  group  or  a  complete  logical 
entry.  Sets  are  numbered  as  they  arrive  in  the  system. 

(c)  The  set  occurrence  directory  has  two  tasks.  It  specifies  the 
set  occurrence  structure  of  the  data  base  and  also  provides 
pointers  to  the  actual  set  occurrences.  Each  position  in  the 
"CFIND"  table  (the  directory)  corresponds  to  a  set  occurrence 
number.  A  position  consists  of  (1)  a  pointer  to  an  occurrence 
(2)  a  repeating  group  component  number  (3)  a  down-pointer  which 
references  the  position  in  the  "CFIND"  table  of  the  immediately 
containing  set.  (4)  An  up-pointer  which  references  the  position 
in  the  "CFIND"  table  of  the  next  set  (higher  set  number)  at  the 
same  data  structural  level.  Level  here  corresponds  to  the  depth 
of  the  component  corresponding  to  the  set  in  the  tree  represent¬ 
ation  of  the  logical  entry  structure.  For  example  in  our 
accounts  file,  components  1,  2,  5,  8,  9  are  all  at  level  0  and 
components  3,  4,  6,  7,  9,  10  are  all  at  level  1.  Since  repeat¬ 
ing  groups  can  be  nested  they  may  appear  at  many  levels.  Note 
that  a  given  set  occurrence  may  contain  a  number  of  occurrences 
of  sets  at  higher  levels  (hence  the  down  pointer) .  Note  that 
the  down  pointer  at  the  logical  entry  level  is  used  to  chain 
logical  entry  occurrences  together. 

(d)  The  final  part  of  the  structure  is  the  set  occurrence  section. 

Each  occurrence  is  composed  of  component  number  -  (value  or  pointer) 
pairs.  The  component  numbers  make  the  occurrences  self¬ 
defining.  The  value  option  is  used  if  the  value  storage  length 
plus  the  component  number  length  is  less  than  or  equal  to  a 
given  length.  If  the  value  storage  length  is  too  long,  a 
pointer  is  given  to  the  'CNAME'  table. 


* 


not  quite  true,  see  later 
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With  this  system  of  file  handling,  long  data  values  only 
appear  once  in  the  system. 

The  "CNAME"  table  contains  all  data  occurrence  values  of 
more  than  the  appropriate  storage  length  in  the  order  in  which 
they  initially  arrive  in  the  system.  This  table  is  also  part  of 
the  inverted  list  structure  as  follows.  The  "CVALUES"  table  also 
contains  values  of  less  than  or  equal  to  the  appropriate  storage 
length.  The  other  values  are  represented  by  a  6-bit  collating 
code  from  the  first  five  characters  of  the  value  plus  a  pointer  to 
the  "CNAME"  table  plus  an  indication  of  if  the  values  adjacent  have 
the  same  first  five  characters.  The  following  figure  shows  the 
logical  setup.  More  detailed  diagrams  with  examples  are  given  in 
the  various  TDMS  publications. 


A  final  point  to  note  is  that  spare  space  is  left  at  appropriate 
positions  in  all  the  tables  so  that  additions  can  occur  with  a 
minimum  of  data  movement  and  auxiliary  storage  re-arrangement. 

5.2.9  An  Evaluation  of  the  File  Structure 

From  our  point  of  view  we  are  interested  in  the  number  of 
auxiliary  storage  accesses  required  to  perform  operations  such  as 
retrieval  and  updating.  These  are  clearly  dependent  on  many  fac¬ 
tors  concerning  the  data  stored  and  the  operations  requested. 

What  we  will  do  here  is  indicate  the  number  of  auxiliary  accesses 
to  be  expected  under  certain  conditions  and  indicate  how  variations 
may  occur.  Let  us  first  consider  the  retrieval  operation.  We 
assume  that  all  parts  of  the  file  structure  are  in  auxiliary  storage 
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and  must  be  brought  into  core  store.  Clearly,  in  practice,  some 
blocks  will  be  resident  in  core  during  a  period  of.  use  of  a  data 
base . 


Generally  speaking  the  "CDEFINA"  table  will  be  small  and  we 
may  allow  1  auxiliary  access  to  retrieve  it.  Two  auxiliary 
accesses  will  be  required  to  retrieve  the  "CVALUES"  table  assuming 
a  reasonable  number  of  values  per  element;  one  access  for  an  index 
and  one  access  for  a  block  of  values.  The  set  of  values  per 
element  would  have  to  be  very  large  to  incur  a  greater  penalty  than 
this . 


We  thus,  so  far,  have  a  total  of  3  auxiliary  accesses  to 
identify  the  position  of  the  set  occurrence  list  for  a  given  value 
of  a  given  element.  This  should  be  multiplied  by  the  number  of 
such  operations  required  in  a  retrieval,  say  n  .  List  correlation 
is  now  required  according  to  the  retrieval  logic.  If  we  assume 
that  the  list  lengths  are  all  short,  then  1  auxiliary  access  can 
be  allowed  for  each  list.  Note  that  if  these  lists  are  long,  then 
list  correlation  is  very  time  consuming  and  in  large  files,  such  as 
bibliographic  files,  we  may  expect  them  to  have  this  property.  As 
a  result  of  our  list  correlations,  we  have  selected  a  number  of  set 
occurrences,  say  n  .  Each  occurrence  implies  at  least  one  access 

to  the  "CFIND"  tab?e.  If  the  retrieval  logic  requires  the  printing 

of  lower  level  quantities,  appropriate  to  the  selected  set  occur¬ 
rences,  then  the  "CFIND"  structure  must  be  utilised  to  select  the 
appropriate  containing  sets.  This  may  require  further  accesses 
depending  on  the  table  layout.  We  will  ignore  this  type  of  re¬ 
trieval  here.  Assuming  that  data  is  required  from  the  selected 
set  occurrences,  we  require  1  access  to  the  "CDATA"  table  for  each 
set.  We  may  also  require  accesses  to  the  "CNAME"  table  to  pick 
up  long  data  values.  Let  us  assume  we  need  on  average  n^  auxiliary 
accesses  per  set  occurrence  for  this  task. 

Our  total  retrieval  cost  is  thus 

4n  +  (2+n,)n  auxiliary  accesses, 
v  v  V  o 

The  minimum  cost  value  for  retrieval  under  our  conditions  is  thus 
6  auxiliary  accesses,  this  occurring  when  n^  =  nQ  =  1  and  n^  =  0. 

The  process  of  storing  long  data  values  once  in  the  "CNAME" 
table  and  using  pointers,  as  opposed  to  storing  in  all  relevant 
occurrences,  is  an  important  one.  It  can  cause  a  considerable 
worsening  of  access  times  if  many  values  must  be  retrieved  but 
may  save  a  vast  amount  of  space.  If  retrieval  requests  are  very 
selective  and  require  only  one  or  two  elements  to  be  printed,  then 
the  penalty  is  worth  suffering  for  the  space  gain,  unless  the  res¬ 
ponse  requirement  of  the  system  is  threatened  by  the  total  online 
load. 

The  cost  of  update  depends  on  the  type  of  update  considered. 
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Comparing  with  retrieval  costs,  we  may  expect  one  extra  auxiliary 
access  for  each  change  to  the  tables,  as  compared  with  table  re¬ 
trieval.  This  is  a  reasonable  assumption  due  to  the  spare  space 
available  in  all  the  tables  (see  section  2.2.1). 

An  exception  is  the  "CNAME"  table  since  a  change  to  the 
"CDATA"  pointers  to  this  implies  finding  the  new  pointer  values, 
which  implies  verifying  whether  or  not  the  new  element  value 
already  exists  in  the  table.  Since  the  values  in  this  table  are 
in  order  of  arrival  we  must  either  scan  the  complete  table  or  we 
require  an  ordered  directory  to  it  in  which  case  several  extra 
auxiliary  accesses  may  be  required.  We  will  not  go  into  this 
further.  We  assume  a  modification  to  a  single  value  in  the  sel¬ 
ected  set  occurrences  of  our  retrieval  example  with  the  qualifi¬ 
cation  that  the  value  is  short  enough  to  be  stored  in  the  "CDATA" 
table  and  corresponds  to  one  of  the  element-value  pairs  used  in 
the  set  occurrence  selection.  Our  set  occurrence  selection  and 
retrieval  cost  is  thus 

4n  +  2n  . 
v  o 

Updating  the  n  occurrences  cost  n  extra  auxiliary  accesses, 
assuming  no  blocking  gain.  All  tHat  remains  to  be  done  is  to 
delete  the  numbers  of  the  selected  set  occurrences  from  the  ele¬ 
ment-old  value  pair  occurrence  list  and  place  them  on  the  element- 
new  value  pair  occurrence  list.  The  deletion  takes  one  auxiliary 
access  assuming  the  old  list  is  retained  in  core  after  its  retrieval 
for  the  selection  process.  Adding  the  set  occurrence  numbers  to 
the  element-new  value  occurrence  list  takes  3  auxiliary  accesses, 
based  on  our  previous  assumptions  plus  the  further  one  that  the  ele¬ 
ment-new  value  pair  is  not  involved  in  the  set  occurrence  selection 
logic. 


Our  total  update  cost  is  thus 

4n  +  3n  +4. 
v  o 

which  is  of  the  same  order  of  magnitude  as  our  access. 


5.2.10  Concluding  Remarks  on  TPMS 

In  considering  any  system  it  is  important  to  look  at  it  as  a 
whole  as  well  as  looking  at  its  constituents.  In  TDMS  two  clear 
points  of  philosophy  show  through  the  detailed  specifications. 

(1)  The  convenience  of  on-line  facilities  is  important  enough 
that  the  provision  of  such  should  be  maximised  within  the 
constraints  of  the  usual  economic  and  technical  consider¬ 
ations.  The  system  allows  all  operations  to  be  initiated 
on-line  even  if  the  operations  themselves  are  too  time- 
consuming  to  be  executed  on-line.  In  the  updating 
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facilities  the  same  operations  are  performed  for  both  on¬ 
line  and  off-line  modules.  One  weakness  of  the  system 
is  that  there  is  no  control  which  prohibits  the  on-line 
user  from  carrying  out  large  tasks  that  should  be  executed 
off-line.  Thus  mis-use  of  the  system  can  lead  to  on-line 
response  delays.  It  is  clearly  not  easy  to  introduce 
such  control.  If  it  were  possible  to  estimate  request 
requirements  from  the  request  details,  then  the  control 
could  occur  at  request  initiation.  Another  possibility 
is  to  give  low  priority  to  any  request  that  has  exceeded 
a  defined  amount  of  resource  cost. 

(2)  The  facilities  are  aimed  at  the  non-programming  user.  It 
is  hard  to  tell,  without  actually  using  a  system,  whether  it 
is  successful  in  this  aim.  However,  from  the  facility  des¬ 
criptions,  TDMS  would  seem  to  be  fairly  successful.  There 
are  a  small  number  of  statements  of  simple  form.  Any  com¬ 
plexity  that  exists  is  due  to  the  structure  of  the  user  data 
and  this  is  something  that  all  users  should  know  about. 

The  file  structure  is  well  designed  especially  as  regards  to 
the  self-defining  and  internal  data  structure  description  facilities. 
The  use  of  an  inverted  list  for  each  element  can  be  inappropriate  in 
applications  where  such  lists  become  long  and  expensive  list  re¬ 
trievals  and  correlations  are  necessary. 

This  description  of  TDMS  is  based  on  references  [55]  to  [59] . 
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CHAPTER  6 


The  Creation  of*  Application 
Computer  Systems  in  a  Project  Framework 


6.1  Introduction 


File  handling  is  an  important  factor  in  many  application  computer 
systems  and  file  specialists  invariably  become  involved  in  the  manage¬ 
ment  of  projects  to  create  such  systems.  Project  management  is,  there¬ 
fore,  a  necessary  subject  in  the  education  of  file  specialists.  For 
this  reason  we  include  the  following  chapter  in  these  notes. 

The  aim  of  this  chapter  is  to  present  a  project  framework  which 
illustrates  the  various  t.asks-stages  involved  in  creating  application 
computer  systems.  In  this  way  some  of  the  managerial  and  technical 
problems  of  project  organisation  are  identified  and  some  suggestions 
are  made  as  to  possible  solutions.  In  some  cases  prevention  is  strongly 
advised.  The  project  framework  itself  is  to  some  extent  arbitrary  - 
other  possibilities  equally  good  may  and  probably  do  exist.  To  criti¬ 
cise  this  chapter  on  these  grounds  is  to  miss  the  main  intention  of  the 
work.  Elaborating,  what  is  presented  here  represents  a  form  (or  model 
if  you  like)  of  project  which  appears  to  have  some  desirable  features. 
However  the  most  important  necessity  of  any  project  is  the  quality  of 
the  personnel  and  their  characteristics  (technical  and  emotional)  -  not 
the  form.  This  chapter  is  therefore  included  in  these  notes  to  attempt 
to  present  some  of  the  pitfalls  that  have  been  recognised  in  the  past 
in  the  hope  that  they  may  in  some  cases  be  avoided  in  the  future. 


6 . 2  The  Stages  in  the  Creation  of  Application  Computer  Systems 

We  delimit  eight  task  stages  in  the  period  from  when  the  idea  of 
setting  up  a  project  has  formalised  through  to  a  running  system.  These 
stages  are;  project  initiation,  application  requirements,  design, 
program,  test,  evaluation,  initial  operation  and  maintenance.  These 
tasks-stages  are  recursive  in  that  they  apply  to  all  subsystems  and 
smaller  units  in  a  complete  system,  although  in  varying  degrees,  depend¬ 
ent  on  the  unit  characteristics.  In  considering  the  complete  system 
there  is  no  intention  that  these  task-stages  are  executed  sequentially. 
Considerable  overlap  and  feedback  is  essential.  It  is  possible  for 
example  for  several  units  of  a  subsystem  to  be  designed,  programmed, 
tested  and  evaluated  before  other  subsystems  are  even  designed,  apart 
from  their  proposed  interfaces.  However  the  task-stages  commence  and 
end  in  the  order  of  presentation  for  each  unit  which  is  involved.  Note 
that,  in  the  example  above, various  feedback  loops  in  the  program  to  eval¬ 
uate  process  may  be  involved  on  a  sub-system  before  a  feedback  to  the 
design  stage  triggers  further  design. 


This  example  is  illustrated  in  Fig.  6.1  on  the  next  page. 
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Fig.  6.1 


Up  to  now  we  have  not  really  specified  what  is  meant  by  the  task 

stages  we  have  introduced.  We  will  briefly  indicate  the  intentions 

in  this  respect  before  fuller  consideration. 

(a)  project  initiation  -  determination  of  frame  of  reference  of 
project  and  also  of  organisation  structure. 

(b)  application  requirements  -  obtaining  specifications  of  what  the 
application  requires  but  not  how  it  should  be  provided. 

(c)  design  -  obtaining  specifications  of  how  the  application  require¬ 
ments  are  to  be  provided. 

(d)  program  -  obtaining  computer  programs  corresponding  to  the  design 
specifications . 

(e)  test  -  testing  the  computer  programs  both  for  correctness  and 
for  performance. 

(f)  evaluating  -  evaluating  the  program  tests  for  the  purpose  of 
acceptance  or  "rejection  for  modification". 
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(g)  initial  op.  ation  -  the  initial  period  of  phasing-in  the  system 
in  the  real  environment . 

(h)  maintenance  -  the  upkeep  of  the  system  throughout  its  usage  life. 

6.3  Project  Initiation  Stage 

This  is  perhaps  the  most  important  phase  of  the  project  in  the  sense 
that  bad  organisation  here  will  wreck  the  project  before  it  is  really 
started.  Also  this  is  probably  the  only  stage  at  which  certain  personnel 
problems  can  be  dealt  with  effectively.  Let  us  look  at  some  of  the 

decisions  which  must  be  made  here. 

1.  Initial  Personnel  Assignment  -  A  few  critical  persons  should  be 
assigned  including  application  representatives  and  design  represent¬ 
atives.  This  organisation  committee  should  have  a  Chairman  who  takes 
responsibility  for  the  project  and  who  interacts  with  the  external 
management.  The  task  of  the  committee  is  to  determine  project 
organisation  and  generally  to  oversee  the  project  through  to  its 
completion. 

A  question  which  is  of  great  concern  regarding  the  appointment 
of  the  chairman  is  whether  he  is  from  the  applications  or  design 
side.  There  are  advantages  and  disadvantages  to  both. 

Since  the  system  is  for  the  application,  it  would  seem  that  an 
application  person  is  best.  On  the  other  hand  much  of  the  project 
will  be  concerned  with  designing.  Also  the  design  point  of  view 
may  be  cut  off  from  the  external  management  with  possible  dire 
consequences . 

The  Author  favours  allowing  the  application  side  to  have  the 
chair  but  with  possibly  a  design  person  specified  to  report  to  the 
external  management . 

In  some  cases  there  will  be  no  real  choice  to  make  with  a 
particular  person  clearly  having  the  required  attributes  for  the 
chair  of  the  project  committee. 

2.  Personnel  Relations  -  At  this  stage  it  is  important  that  all  the 
persons  involved  settle  down  to  working  together.  If  obvious 
incompatibilities  arise,  these  should  lead  to  personnel  changes  if 
possible.  Two  important  ingredients  are,  respect  and  willingness 
to  work  within  the  project  committee  framework.  An  important 
factor  in  this  should  be  an  agreed  procedure  for  handling  disputes. 
This  should  include  the  possibility  of  either  agreement  to  "carry" 
disputes  for  later  settlement  or  a  mechanism  to  interact  with 
external  management  where  necessary.  These  mechanisms  should  be 
carried  over  to  later  project  task-stages. 


One  of  the  problems  of  having  persons  with  differing  back¬ 
grounds  and  interests  in  the  committee  is  the  non-communication 
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problem.  Terminology  is  used  in  discussions  which  is  either 
meaningless  to  the  other  side  or,  worse,  is  understood  to  mean 
something  different.  Rectifying  discussion  in  many  cases  does 
not  really  work  although  appearing  to  do  so.  For  this  reason 
it  is  important  to  put  considerable  effort  into  a  dictionary  of 
terms  (more  haste,  less  speed).  This  can  fit  nicely  into  the 
production  of  a  conceptional  model  of  the  project  area  which  in 
itself  helps  to  clarify  what  is  to  be  done. 

3.  Documentation  -  Decisions  must  be  made  on  the  degree  of  document¬ 
ation  for  the  project  including  specification  of  what  documents 
are  needed  and  who  is  responsible  for  (a)  accepting  them  (b)  up¬ 
dating  them  (c)  discussing  alterations  to  them.  The  Author  believes 
that  documentation  is  the  key  to  co-ordination  and  project  control. 

In  the  sections  on  the  later  task-stages  a  documentation  system  will 
be  presented  (see  also  the  suggestions  in  [53]).  A  key  issue  is 
what  can  be  documented  in  advance  (so  to  speak)  and  what  is  best  left 
until  after  the  event.  It  should  be  realised  that  documentation  in 
advance  must  imply  a  policy  of  documentation  change  whenever  warr¬ 
anted  and  also  the  existence  of  organised  groups  forming  decision 
centres  for  document  change  and  project  co-ordination.  The 
suggestion  here  is  that  to  some  extent  we  have  structured  our  docu¬ 
mentation  and  personnel  in  parallel  with  our  system  structure. 

The  following  four  aims  illustrate  the  possible  power  of  document¬ 
ation. 

(a)  to  provide  information  to  interested  persons 

(b)  to  act  as  working  documents  forming  a  basis  for  co¬ 
ordination  and  decision 

(c)  to  act  as  educational  and  reference  material  for  the 
design  and  implementation  staff 

t  (d)  to  serve  as  a  basis  for  final  documentation  of  the 
system 

4.  Project  Strategy  and  Constraints  -  This  is  concerned  with  the  global 
structure  of  the  project  and  how  it  is  affected  by  constraints.  The 
following  are  possible  structures: 

(a)  single  all  encompassing  plan  for  single  imple¬ 
mentation 

(b)  series  of  planned  implementations  dealing  with  sub¬ 
systems  and  combining  eventually  to  a  total  system 

(c)  series  of  implementations  planned  one  at  a  time  as 
total  computer,  systems  but  increasing  in  capability 
and  sophistication. 

Which  of  these  is  best  depends  both  on  the  problem  and  the  constraints. 
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Plan  (a)  is  most  feasible  where  the  problem  is  well  understood  and  not 
too  large.  Unfortunately  the  application  environment  may 
make  it  very  difficult  to  do  otherwise.  In  some  cases  a 
possibility  is  to  carry  a  series  of  implementations  through 
to  evaluation  and  only  bring  the  last  one  into  operation. 

Note  that  valid  testing  is  difficult  under  these  conditions. 

Plan  (b)  is  most  feasible  where  phased-in  subsystems  with  partial  old 
and  partial  new  system  operation  is  desirable  and  possible. 
However  some  gain  can  result  even  when  the  series  is  put  into 
operation  in  a  single  step  since  later  subsystems  benefit  from 
lessons  learnt  with  the  earlier  ones. 

Plan  (c)  is  especially  relevant  where  the  problem  being  tackled  is 
pushing  the  state  of  the  art  and  unknown  factors  must  be 
expected  of  some  seriousness.  An  excellent  comment  on 
experimental  design  appears  on  p.327  of  [7]. 

So  much  for  the  project  strategies.  The  type  of  constraints  that 
will  affect  the  strategy  and  some  of  its  details  are  concerned  with  re¬ 
source  availability,  where  resource  includes  personnel,  money,  hardware, 
computer  time,  etc.  For  example  the  project  may  be  constrained  to  use 
certain  hardware  currently  available  in  the  parent  organisation.  At 
this  task  stage  the  constraints  should  be  specified  but  the  implications 
by  and  large  left  for  later  task-stages. 

5.  Frame  of  Reference  -  The  project  must  have  a  frame  of  reference 
which  consists  basically  of  a  brief  indication  of  application 
requirements.  This,  together  with  the  constraints,  should  be 
sufficient  to  decide  on  a  project  strategy  and  to  enable  an  appli¬ 
cation  requirement  stage  to  be  initiated.  This  requires  the 
setting  up  of  an  application  requirement  group.  It  should  be 
realised  that  the  tasks  of  the  organisation  committee  do  not  termi¬ 
nate  at  this  stage.  The  committee  must  remain  in  existence  to 
generally  oversee  the  project  (settle  disputes,  etc.)  and  to  make 
organisational  changes  if  necessary. 

6.4  Application  Requirements  Stage 

The  applications  requirement  group  should  consist  of  mainly  appli¬ 
cation  personnel.  However,  at  least  one  design  person  must  be  included 
(and  possibly  more) .  The  major  task  of  the  design  personnel  at  this 
stage  is  to  retain  feasibility,  although  this  must  be  carried  out  with 
some  care  so  as  not  to  lock  out  real  possibilities.  One  danger  with 
the  group  structured  as  above  is  the  well-known  problem  that  users  do  not 
know  what  they  want  in  many  cases.  They  may,  however,  know  what  they 
do  not  want  (which  may  include  all  feasible  systems).  There  is,  thus, 
a  strong  case  for  outside  consultants  to  be  involved  in  some  way  here  (and 
maybe  at  project  initiation),  thus  bringing  to  the  group  a  type  of  person 
who  has  experience  in  evolving  requirements  satisfactory  to  applications 
personnel. 
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The  basic  output  of  this  stage  is  the  application  requirements 
document.  It  should  describe  what  operations  are  required  of  the 
system  and  also  predict  usage  patterns  such  as  volumes,  frequencies,  etc. 
In  many  cases  this  can  involve  extensive  statistics  gathering  of  present 
conditions.  The  tone  of  the  document  should  be  what  not  how  and  could 
be  written  in  normal  English  with  possibly  some  flowcharts  (high  level) 
and  diagrams.  A  pictorial  description  of  the  logical  system  should  be 
included  here  if  not  already  provided  at  project  initiation. 


6.5  Design  Stage 

This  should  be  controlled  by  a  small  group  of  designers.  An 
applications  person  should  be  in  the  group  as  liason  with  the  application 
environment  (may  be  seconded  from  application  requirement  group) .  A 
single  person  in  the  group  must  take  responsibility  for  the  stage.  The 
input  document  for  the  group  is  the  application  requirements  document. 

The  output  is  a  series  of  design  working  documents  consisting  of  master 
design  document,  subsystem  module  design  documents  (one  per  functional 
module  if  appropriate) ,  operational  support  design  document. 

1.  Master  Design  Document  -  This  is  a  description  of  the  system  which 
is  to  be  implemented.  It  should  describe  the  total  system  in 
terms  of  subsystems  which  have  well-defined  positions  in  the  system. 
The  subsystem  interfaces  should  be  specified.  There  may  or  may  not 
be  an  equivalence  in  these  subsystems  to  the  applications  require¬ 
ment  model  mentioned  earlier.  In  any  event  this  document  lays  out 
considerable  detail  as  compared  to  the  model. 

2.  Subsystem  Module  Design  Documents  -  Each  of  these  documents  des¬ 
cribes  in  detail  how  the  corresponding  subsystems  are  composed  of 
programmable  units  (e.g.,  subprograms  and  their  relationships). 
Described  here  are  all  subprograms  which  are  not  classified  as 
operational  support  (see  later) .  Again  all  program  interfaces 

are  detailed,  e.g.,  names,  input  parameters,  output  parameters,  etc. 

3.  Operational  Support  Design  Document  -  This  document  describes  both 
the  basic  computer  system  support  and  also  the  extended  system 
support  software  which  is  utilised  to  support  the  system.  For 
example  utilising  OS/360,  IBM  360/65  and  language  PL/I  could  con¬ 
stitute  basic  support.  Storage  devices  also  come  under  this 
heading.  It  is  important  to  delimit  any  options  that  may  exist; 
e.g.,  there  may  be  several  different  storage  devices  that  could 

be  used  in  practice. 

The  extended  support  software  consists  of  routines  which  have 
to  be  programmed  in  the  implementation  and  correspond  to  basic 
tasks  shared  by  more  than  one  subsystem.  An  example  of  a  set  of 
such  tasks  could  be  a  set  of  character  or  string  manipulation 
routines.  Note  that  there  may  be  other  routines  shared  by  several 
subsystems  which  are  not  included  in  the  extended  support  module 
but  are  in  a  specially  designated  subsystem. 
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This  stage  is  an  unusual  one  in  that  the  methodology  involved 
tends  to  be  a  both  ends  towards  the  middle  approach  rather  than  a  top- 
down  or  bottom-up  approach.  The  transition  from  logical  to  physical 
facilities  is  made  here  and  is  carried  out  taking  both  the  physical 
support  and  the  application  requirements  into  account. 

Some  of  the  facilities  which  have  to  be  chosen  (if  they  are  not 
already  constrained)  are,  machine  configuration,  operating  system,  and 
languages.  A  good  principle  is  to  use  as  high  a  level  of  existing 
facilities  as  possible.  This  sometimes  leads  to  interfacing  diffi¬ 
culties  but  is  in  many  cases  worth  the  effort.  The  usual  properties 
of  portability,  flexibility,  generality,  extendability  and  efficiency 
as  well  as  ease  of  usage  are  desirable  but  often  conflicting. 

A  further  responsibility  which  rests  with  the  design  group  is  to 
try  to  schedule  the  implementation  and  also  to  predict  performance  and 
lay  down  acceptance  standards.  Identification  and  isolation  of  criti¬ 
cal  areas  of  the  system  is  an  important  aspect  of  this  task.  We  will 
not  consider  these  matters  further  other  than  to  point  out  that  they 
are  not  easy  tasks . 

6 . 6  Programming,  Testing  and  Evaluating  (Implementing)  Stages 

We  will  not  consider  these  stages  in  detail  other  than  to  state 
that  the  previous  comments  on  documentation  are  hopefully  a  consider¬ 
able  aid  to  co-ordination,  enabling  small  groups  to  work  independently 
and  interface  via  the  document  specifications.  Note  the  implication 
that  programmers  may  bring  about  changes  in  their  frame  of  reference 
as  well  as  programming  within  that  reference.  However,  the  changes 
are  controlled.  Owing  to  the  problems  of  document  updating  (delays, 
etc.)  it  is  probably  a  good  rule  that  once  a  group  of  persons  of 
below  a  certain  size  are  assigned  to  an  area  of  the  system  with  speci¬ 
fied  interface,  their  own  internal  work  is  documented  after  the  event 
and  not  before.  This  should  keep  the  number  of  documents  from  becoming 

too  large.  '  Note  that  problems  at  this  stage  resulting  in  failure  to 

meet  predictions  may  in  some  cases  lead  to  changes  in  both  the  design 
and  application  requirements  documents. 

System  Monitoring  should  be  considered  an  important  factor  at 
this  stage  and  in  fact  should  have  been  planned  in  the  design  stage. 

Statistics  should  be  generated  and  considerable  effort  put  into 

obtaining  good  test  data. 

6.7  Initial  Operation  and  Maintenance  Stages 

Again  we  will  not  consider  these  stages  in  detail  except  to  say 
that  this  is  what  all  the  other  stages  were  for  -  to  produce  a  working 
system.  Undoubtedly  problems  will  arise  here  that  were  not  foreseen 


69 


and  which  require  some  if  not  all  of  the  project  organisational  groups 
to  remedy.  Hence  the  project  organisation  must  remain,  in  existence 
for  a  considerable  period  after  initial  system  operation.  Eventually 
the  system  will  have  settled  down  and  should  be  assigned  a  maintenance 
staff  including,  some  of  the  project  personnel,  to  keep  things  running 
smoothly.  Note  that  test  data  and  tests  should  be  retained  (and 
possibly  more  generated)  for  checking  the  system.  Also  monitoring 
should  not  end  at  operation  initiation  time.  We  will  terminate  this 
chapter  with  the  following  comment.  No  matter  how  much  testing  and 
planning  is  carried  out,  the  real  live  conditions  of  a  working  system 
cannot  be  imitated  fully.  Problems  will  arise  in  operations  and  the 
project  organisation  must  take  this  important  factor  into  account. 
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